1 Introduction
Recent research on interpretability explained the predictions in a posthoc manner: given a trained predictive model with predicted scores , use extracted information to explain how the model made predictions. Such extracted information can be analyzed and displayed by data analysis techniques, such as gradientbased methods [19, 17, 16, 22], and sensitivity analysis [14, 12], or by using a mimic model to minimize , such as tree and rulebased models [9, 2, 8, 21]. Summing up, the posthoc methods do not change or improve the underlying predictive model and use simpler forms, such as linear models, to explain the relationships in a complex predictive model, thereby making it easier for users to understand the interpretations.
Posthoc methods, though effective in interpreting models in an easytounderstand way, have two limitations: (a) When we use different posthoc methods to explain the same predictions, the explanations may be contradictory with each other [1]. The human cost for determining these methods and the samples to train the posthoc models are prohibitive with larger datasets. Moreover, it is unclear whether we can aggregate different posthoc explanations [18]; (b) The posthoc methods focus on using simpler models to provide a higher descriptive accuracy, which measures the degree to which a posthoc method properly describes the patterns learned by a predictive model. Usually, the simplicity of posthoc methods helps users understand the patterns, but provides imperfect representations of nonlinear relationships among variables in complex blackbox models. Their interpretability is obtained at the expense of prediction accuracy [13].
To overcome these limitations, there is a need for modelbased interpretations, which come from the construction of the prediction model [13]. Such models (a) have a predefined structure and can readily describe relationships between input features and predictions once the models are learned. It does not require the determination of the posthoc
models and is estimated on all training data, therefore it avoids the interpretations being changed dramatically when different methods or subsets of the data are used. (b) Such models should interpret the relationships between input features and predictions in a simple and explicit way, such as describing feature contributions via providing feature shapes that can be easily understood by users. In the meanwhile, it should be expressive enough to properly fit the data and achieve good prediction performance.
In this work, we propose a hybrid Piecewise Linear and Deep (PiLiD) model under a Wide Deep (short for WD) scheme [3] as shown in Figure 1. The proposed model is comprised of a piecewise linear component and a nonlinear component. The first component uses piecewise linear functions to approximate the complex relationships between the input features and predictions. Such a form is explicit enough to describe the feature contributions by providing feature shapes, yet increases the expressiveness of the model. The other component uses a multilayer perceptron (MLP) to capture the feature interactions and increase the prediction performance. The two components are jointly trained and the interpretability (in the form of piecewise linear functions) is obtained once the model is learned. This work has the following contributions:

We propose the PiLiD model that enhances the interpretability under a WD scheme. Different from the posthoc methods, the interpretability is obtained once the prediction model is learned.

The proposed PiLiD model is flexible for adaptations. We develop the PiLiB model, a variant to extract interpretable higherorder interactions.

We integrate a predefined interpretable structure into the linear component to decipher the contribution of features by piecewise linear approximation. Such a form provides explicit feature shapes while preserving high expressiveness.

As a result of the jointtraining scheme, we show that the model can describe complex feature shapes while improving model performance.
2 Related Work
2.1 Wide and Deep Scheme
The Wide Deep (WD) scheme jointly trains a linear model and a deep neural network to benefit from memorization and generalization [3]. This framework takes advantage of both the Wide (linear model) and Deep (deep neural network) components by jointly training both together, and thus outperforms the models with either Wide or Deep component in a largescale recommender system [9, 4, 23]. The Wide component of the WD scheme also provides interpretations of the features’ main effect. Hence W
D has been used to develop interpretable models. Tsang et al. tsang2017detecting developed a model that modified the Deep component to only extract pairwise interactions among features. Similarly, Kraus and Feuerriegel kraus2019forecasting changed the Deep component to a recurrent neural network to capture the temporal information. Wang and Lin
[21] proposed a hybrid model, in which an interpretable model and a blackbox model function together to obtain a good tradeoff between model transparency and performance. Effective as they are, these WDbased interpretable models’ prediction performance is worse than the complex blackbox model (the Deep part alone).2.2 Interpretable Methods
There are mainly two categories of interpretable machine learning methods. The first refers to
posthoc methods which initially learn an original predictive model and then explain how the model obtains the underlying predictions. This type of models usually fall into two subcategories, predictionlevel and datasetlevel interpretations [13].The predictionlevel methods focus on locally explaining why the model makes a particular prediction. They tell users what individual features or feature interactions are the most important by standard data analysis such as the gradientbased methods [19, 17, 16, 22], the sensitivity analysis [14, 12], the stepwise feature removal approaches [15], and the mimic models [9, 2, 8, 21]. In contrast, the datasetlevel methods are interested in global explanations that explore more general relationships learned by the models. Tan et al. tan2018learning described feature contributions by providing a global additive value function. Both prediction and datasetlevel methods suffer from the aforementioned drawbacks of posthoc scheme.
Another category of approaches predefines an interpretable structure, and provides insight into relationships between input features and predictions once the model is learned [6]. Generalized additive model (GAM) can model extreme nonlinearity on individual features [5, 11], but it cannot model feature interactions. In this regard, Lou et al. lou2013accurate developed the GAM model, which first learns a base GAM without any interactions and then selects a number of pairwise interactions that minimize residuals. Unfortunately, the obtained individual and interacting feature contributions can be extremely complex because there is no regularization on their shapes, and this model is slow to converge. For this reason, Tsang et al. tsang2018neural divided a neural network into several samesized blocks and used regularization to model both univariate and highorder interactions, however, only a subset of individual features can be interpreted and the accuracy decreases because the model does not capture the nonlinearity other than the learned feature interactions. The main challenge, as the main purpose of this paper, is to come up with a structure that is simple enough to help users understand the rationale behind the predictions, while keeping the model sophisticated to take care of the noninterpretable nonlinearity in the data.
3 The Proposed PiLiD Model
3.1 Problem Setting
Vectors are represented by boldface lowercase letters, for example x. Note that the index for th vector is in brackets, such as is the th data point, and the th element of a vector w is ; matrices are denoted by boldface capital letters, for example W and the element of W is . Let be a set of data samples, where is the th data point described by features, and is a real value (class label) to be predicted in regression (classification) problems.
We assume that there exists a mapping function describing the relationship between the prediction and the individual features (main effects) and interacting features (interaction effects). To predict , we define:
(1) 
where is a constant term^{2}^{2}2Although it can be left out in some prediction problems, for instance the binary classification problems where it does not affect the relative preference between two data samples. In this study, it can be decomposed into smaller values , and each of them is set as a constant term in the corresponding marginal value function., is a marginal value function of th feature, and is a function of all features. Eq.(1) describes (a) a regression model if is the identity, and (b) a classification model if is the logistic function of the identity.
Given the following definitions, we use piecewise linear functions to approximate the marginal value functions.
Definition 1.
The characteristic points are some predefined values to partition the whole feature value scale into several pieces.
For categorical features, the characteristic points are all unique feature values. Let be the ordered values of , where and . For numerical features, let , where and , be the whole evaluation value scale. We partition the scale into equal subintervals , where are characteristic points.
Definition 2.
The feature vector of data point is defined as follows:
(2) 
where and
Definition 3.
The marginal value vector u is defined as:
(3) 
where and , is the difference of marginal values between two consecutive characteristic points.
We use an MLP to approximate the interaction effects
. Here it is a feedforward neural network with
hidden layers. There are hidden units in the the layer and . The layer matrices are denoted by, and bias vectors are denoted by
. The nonlinear activation function is denoted as
. Let and be the final weight vector and bias for the output . In this way, the MLP with layers can be represented by:(4)  
(5)  
(6) 
3.2 Model Description and Training Process
According to Definition 2, the vectorized input layer in piecewise linear component transforms the original data into a ‘wider’ vector by piecewise linear partition. The input of each unit in this layer corresponds to an element in vector , and every units are decomposed from one feature. The first layer in the piecewise linear component has units and has a form , where and . There is no any activation function. The next layer groups every units ( groups in total), and then sums the outputs of units in each group. Note that when , the piecewise linear component PiLiD degenerates to a simple linear model providing a single value describing feature contributions, i.e., feature attributions. When , all feature values become characteristic points, PiLiD has to fit the curve pointbypoint. Thus, it can fit any curve given sufficient data. Without the constrain in , it may use an extremely complex shape to fit the curve, causing the overfitting problem.
The outputs of two components are fed into a specific loss function for joint training. The joint training process uses the minibatch stochastic optimization algorithm to backpropagate the gradients from both components at the same time. The general loss function is:
(8) 
where is a neural network with parameters, are the vectors of parameters in the piecewise linear component, are the parameters, including weights matrices , bias vectors , weight vector and bias term for the output of nonlinear component, is the loss function. Specifically, is the crossentropy loss for classification problems and the mean square error for regression problems. is a predefined coefficient for the regularization function that can be or regularization.
The gradientbased algorithms can make the optimal solution robust and fast to converge by initializing the parameters in a neural network [7]
. In this study, the parameters in the piecewise linear component are not randomly initialized because their optimal values are supposed to indicate the contributions of individual features to the predictions. For this reason, we use the parameters obtained by a simple linear regression to initialize the parameters in the piecewise linear component. Such initialization forces the optimization algorithm starts from a more promising and interpretable solution region. As for the nonlinear component, we use a standard Gaussian initialization. The pseudo code for learning the proposed model is presented in Algorithm
1 (for brevity, we raise a regression problem as an example):3.3 A Variant
To explore interacting effects in nonlinear component and demonstrate the flexibility of the proposed model, we substitute the nonlinear component with the stateoftheart Neural Interaction Transparency (NIT) model to decipher the highorder feature interactions [20].
The original NIT model partitions a whole feedforward neural network into several ‘blocks’ with the same size. It uses the regularization to force some hidden units to have zero weights in each ‘block’, and adds an extra term in the loss function to encourage the model to learn smallersized interactions. We refer interested readers to Tsang et al.tsang2018neural. In this work, we propose a variant of PiLiD, namely the Piecewise Linear and Blocks (PiLiB) to model all individual features and important interacting features.
The piecewise linear component in PiLiB is same as that in PiLiD. The nonlinear component is divided into samesized block (shown on the right in Figure 1). The loss function of PiLiD is:
(9)  
(10) 
Note that if a feature is active in a block, thus the estimated interaction order is defined as . The first term in limits the maximum interaction order to be equal to or less than . Different with Tsang et al. tsang2018neural, the second term encourages the model to learn pairwise interactions. In this way, we push the modelling of the main effects to the piecewise linear component and the pairwise interactions to the nonlinear component. The algorithm for training parameters in PiLiB is in supplementary materials.
# features  10  20  30  40  50 

PiLiD1  0.2677 0.0073  0.3041 0.0147  0.2770 0.0060  0.3406 0.0139  0.3522 0.0121 
PiLiD5  0.2666 0.0070  0.3042 0.0079  0.2752 0.0067  0.3318 0.0155  0.3553 0.0152 
PiLiD10  0.2678 0.0089  0.3060 0.0158  0.2793 0.0141  0.3383 0.0184  0.3491 0.0178 
PiLiD15  0.2662 0.0070  0.3039 0.0124  0.2760 0.0087  0.3291 0.0133  0.3473 0.0134 
PiLiD20  0.2657 0.0058  0.3023 0.0077  0.2793 0.0064  0.3296 0.0129  0.3448 0.0144 
MLP  0.2705 0.0133  0.3092 0.0067  0.2816 0.0035  0.3599 0.0218  0.3681 0.0236 
SVM (rbf kernel)  0.2801 0.0047  0.3262 0.0079  0.2856 0.0058  0.3935 0.0083  0.3865 0.0073 
Random Forest  0.2703 0.0052  0.3131 0.0121  0.3160 0.0069  0.4032 0.0081  0.3998 0.0083 
Linear Regression  0.3518 0.0053  0.3952 0.0095  0.2873 0.0058  0.3925 0.0082  0.4176 0.0098 
GAM  0.2976 0.0040  0.3250 0.0060  0.2804 0.0050  0.3808 0.0052  0.3726 0.0068 
# objects  Initi.  G. Ini.  Initi.  G. Ini.  Initi.  G. Ini.  Initi.  G. Ini.  Initi.  G. Ini. 

5000  0.3755  0.3860  0.3704  0.3625  0.3752  0.3909  0.3684  0.3603  0.3686  0.3666 
10000  0.3427  0.3612  0.3407  0.3305  0.3348  0.3443  0.3359  0.3477  0.3341  0.3428 
15000  0.3182  0.3105  0.3147  0.3251  0.3101  0.3155  0.3105  0.3176  0.3087  0.3275 
20000  0.3083  0.3273  0.3066  0.3091  0.3081  0.3209  0.3045  0.3091  0.3044  0.3055 
4 Experiments
Our experiments are conducted on synthetic and real datasets. The simulations aim to answer the following questions: (a) Can the proposed PiLiD accurately describe the actual marginal value functions while outperforming other predictive models? (b) Does the initialization process help improve interpretability and accuracy? The experiments on real datasets aim to show the advantages of PiLiD and PiLiB over several stateoftheart interpretable methods.
4.1 Simulations
The synthetic data is generated as follows: (a) Randomly determine the coefficients of polynomial functions (in 10 degrees) as the actual marginal value functions. (b) Randomly determine objects and their feature values from interval . (c) Use actual marginal value functions to calculate the marginal values of objects and randomly add some feature interactions. (d) For regression problem, is the summation of marginal values and selected interactions; For binary classification problem, we use a logistic function to assign 0 and 1 labels. In this section, we present the results of regression problems due to length limit. The classification problem has similar results, which are presented in the supplementary materials.
The MLP used in PiLiD has a 1002004004002001001 structure. We use ADAM with 0.005 learning rate as the standard optimizer in Algorithm 1. For each problem setting, we run 20 trials. 80% is selected as the training set and 20% as the testing set. We present experimental results of regression problems with in Table 1. The full simulation results can refer to supplementary materials.
When the synthetic data size is small, both MLP and PiLiD perform worse than classic statistical learning models (see results in supplementary materials when ), such as SVM and Random Forest. When the data size is larger, more data are fed into the neural network, as a result, both MLP and PiLiD perform better than statistical learning models (Table 1). Encouragingly, we find that PiLiD outperforms MLP in most tasks given the same neural network setting and the number of iterations. At last, as the increase in the number of predefined intervals , the piecewise linear component becomes more complex, and in turn increases the model accuracy. This indicates that the piecewise linear component could actually improve the performance if the learned interpretability is close to the actual one. Nevertheless, we suggest setting to avoid the overfitting problem and additional computational cost.
bike sharing (MSE)  bank marketing (AUC)  spambase (AUC)  skill (AUC)  

PiLiD  0.0680.0087  0.9380.0031  0.9780.0021  0.8770.0058 
PiLiB  0.1070.0109  0.9240.0028  0.9650.0047  0.8620.0047 
MLP  0.0710.0091  0.9290.0027  0.9660.0031  0.8640.0049 
NIT  0.0900.0088  0.8940.0367  0.9610.0039  0.8580.0051 
GA  0.1370.0023  0.8820.0098  0.9350.0023  0.8510.0037 
GAM  0.2150.0015  0.8560.0032  0.9230.0009  0.8110.0023 
GAM  0.1010.0021  0.9020.0055  0.9590.0013  0.8560.0038 
Summing up, the proposed PiLiD performs better because (a) the piecewise linear form increases model expressiveness; and (b) the proposed initialization in Algorithm 1 and the joint training process help obtain optimal solutions (see results in Table 2). To demonstrate that the proposed initialization helps obtain rational interpretations, we provide an example of learned marginal value functions by two initializations in Figure 2. Obviously, although Gaussian initialization can sometimes obtain results as accurate as the proposed initialization, it fails to properly describe the actual marginal value functions.
4.2 Real Data
Bike share  Bank market  Spambase  Skill  
# objects  17,379  40,787  4,601  3,302 
# features  15  48  57  18 
We conduct experiments using four real datasets to compare the prediction performance of the proposed PiLiD and PiLiB, and a set of baseline models including global additive explanation model (GA) [18], GAM [11], GAM [10], and NIT [20] models. The datasets are described in Table 4^{3}^{3}3The datasets can be downloaded from supplementary materials..
Table 3 presents the performance of different models. Comparing with NIT and MLP, the proposed PiLiD performs better because it benefits from the extra piecewise linear component that considers more complex feature shapes and the jointly training process. Comparing with the GAM family, although GAMs can model extremely complex feature shapes, they cannot account for possible higherorder interactions (except for GAM, which only deals with the pairwise interactions). Therefore, both PiLiD and PiLiB outperform all GAMs.
To demonstrate the interpretability of PiLiD and PiLiB, we present feature shapes of Age in dataset Skill given different . First, from Figure 3(a) and other plots, we stress the importance of providing feature shapes rather than only feature attributions that uses a single value to describe the feature contribution. All other plots () can capture the changing tendency of the relationship between Age and players’ skill estimation (prediction). Such tendency indicates that, at first, the aging process positively affects players’ skill because they can gain experience over time, but it changes when players are older because their response speed is affected. Second, as stated in Schwab and Karlen schwab2019cxplain, the explanations should be robust. From plots (b) to (e) in Figure 3, the obtained feature shapes have a similar tendency given predefined number of intervals. That is because the proposed initialization enforces the optimization to start from a more promising region, and thus makes the interpretability more robust and stable.
In Figure 4, we present an example of the learned interacting effects by PiLiB
. These relationships help to understand why an email is classified as spam. For example, an email is more likely to be a spam if we observe the overuse of continuous capital letters and less use of the term
receive. Interpreting and applying these extracted interactions needs further examinations, and will be our future work.5 Conclusion
We propose a hybrid interpretable model that uses piecewise linear functions to approximate the individual feature contributions. It is flexible to adapt for other model structures. The experiments demonstrate that the model is explicit enough for users to understand and also has stateoftheart prediction performance. This research shed new light on the joint learning for interpretability and predictability, and the feasibility of using the learned interpretability to enhance the prediction performance.
6 Supplementary Material
6.1 Algorithm for Training PiLiB
Given loss function , the training process of PiLiB is in two phases. At the first stage, we jointly train the linear component with parameter regularization and deep component with only regularization. After the maximum interaction order in the deep component is smaller or equal to the allowable , we go to next phase. In the second phase, the trained ‘masks’ for the first layer are fixed and on this basis, we optimize all parameters in the model with standard regularization. The training process can be summarized by Algorithm 2.
6.2 Extend Simulation Results
Here we present full simulation results of regression and classification problems. Tables 5 to 10 present the simulation results given different numbers of objects and features for two problems. We emphasize three interesting patterns:

As stated in main paper, when the dataset is smaller, for example , since the proposed PiLiD
is under a deep learning scheme, it requires sufficient data to train parameters. Therefore,
PiLiD and MLP’s performance on simulation data are worse than some traditional predictive models. However, when the datasets are larger (), the PiLiD perform better. 
In classification problems, the performance of the proposed PiLiD is on a par with or comparable to MLP. The proposed PiLiD performs better than MLP given same regression problem settings. That is because the pretriaing process in Algorithm 1 helps PiLiD find better solutions. Moreover, the piecewise linear form also increases the expressiveness of the entire model, thereby being more powerful than a pure MLP.

There is a tendency that the predictive performance increases along with the increase of predefined number of intervals. It makes sense because the larger number of intervals, the more complex the linear component. As stated in main paper, an extreme simulation is let , then the linear component becomes a GAM, which is more powerful than linear models. However, larger number of intervals will require more computational cost and data samples, as a tradeoff, we suggest using .
# features  10  20  30  40  50 

PiLiD1  0.3002 0.0258  0.2894 0.0155  0.4217 0.0377  0.4264 0.0230  0.4397 0.0417 
PiLiD5  0.2976 0.0162  0.2920 0.0147  0.4158 0.0387  0.4252 0.0260  0.4214 0.0237 
PiLiD10  0.2993 0.0256  0.2906 0.0152  0.4206 0.0458  0.4238 0.0328  0.4415 0.0440 
PiLiD15  0.2955 0.0341  0.2896 0.0150  0.4056 0.0349  0.4196 0.0246  0.4316 0.0434 
PiLiD20  0.2954 0.0262  0.2917 0.0180  0.4133 0.0539  0.4163 0.0250  0.4262 0.0314 
MLP  0.3108 0.0402  0.3122 0.0223  0.4623 0.0418  0.4782 0.0643  0.4832 0.0563 
SVM (rbf kernel)  0.6951 0.0675  0.2803 0.0139  0.4649 0.0309  0.3759 0.0168  0.3935 0.0202 
Random Forest  0.2742 0.0119  0.3330 0.0152  0.4217 0.0251  0.5448 0.0212  0.6183 0.0264 
Linear Regression  0.7195 0.0545  0.4768 0.0150  0.6327 0.0371  0.6167 0.0392  0.6370 0.0216 
GAM  0.3302 0.0301  0.3821 0.0150  0.4767 0.0309  0.4866 0.0249  0.4984 0.0320 
# features  10  20  30  40  50 

PiLiD1  0.2839 0.0179  0.2940 0.0142  0.3376 0.0134  0.3420 0.0245  0.4561 0.0344 
PiLiD5  0.2795 0.0098  0.3035 0.0267  0.3366 0.0253  0.3361 0.0110  0.4478 0.0434 
PiLiD10  0.2818 0.0145  0.2904 0.0106  0.3310 0.0148  0.3393 0.0197  0.4314 0.0185 
PiLiD15  0.2771 0.0066  0.2942 0.0146  0.3386 0.0275  0.3340 0.0167  0.4357 0.0321 
PiLiD20  0.2770 0.0087  0.2952 0.0137  0.3279 0.0121  0.3384 0.0219  0.4321 0.0598 
MLP  0.2846 0.0171  0.3036 0.0175  0.3419 0.0213  0.3627 0.0141  0.4799 0.0441 
SVM (rbf kernel)  0.2973 0.0066  0.4005 0.0144  0.3140 0.0069  0.3815 0.0111  0.4877 0.0126 
Random Forest  0.2710 0.0060  0.3129 0.0059  0.3891 0.0104  0.3923 0.0092  0.5996 0.0168 
Linear Rregression  0.3458 0.0066  0.5705 0.0201  0.3718 0.0092  0.3834 0.0111  0.4894 0.0121 
GAM  0.3240 0.0043  0.4106 0.0233  0.3570 0.0064  0.3679 0.0129  0.4755 0.0203 
# features  10  20  30  40  50 

PiLiD1  0.2593 0.0115  0.2972 0.0087  0.3496 0.0539  0.3380 0.0170  0.3470 0.0295 
PiLiD5  0.2570 0.0069  0.2999 0.0115  0.3312 0.0141  0.3412 0.0274  0.3439 0.0203 
PiLiD10  0.2567 0.0071  0.3017 0.0108  0.3214 0.0111  0.3309 0.0127  0.3398 0.0122 
PiLiD15  0.2569 0.0076  0.2974 0.0100  0.3226 0.0124  0.3386 0.0270  0.3370 0.0259 
PiLiD20  0.2592 0.0084  0.2963 0.0116  0.3215 0.0116  0.3296 0.0199  0.3367 0.0156 
MLP  0.2592 0.0064  0.3065 0.0092  0.3463 0.0172  0.3411 0.0164  0.3500 0.0194 
SVM (rbf kernel)  0.2598 0.0071  0.3034 0.0065  0.3325 0.0076  0.3426 0.0077  0.3520 0.0077 
Random Forest  0.2619 0.0096  0.3079 0.0068  0.3357 0.0072  0.4424 0.0116  0.4146 0.0114 
Linear Regression  0.3006 0.0080  0.3605 0.0077  0.4165 0.0097  0.4664 0.0095  0.4574 0.0088 
GAM  0.2790 0.0072  0.3301 0.0070  0.3715 0.0088  0.4069 0.0096  0.3984 0.0088 
# features  10  20  30  40  50 

PiWiLiD1  0.8405 0.0018  0.9079 0.0041  0.9241 0.0020  0.8870 0.0018  0.9137 0.0068 
PiWiLiD5  0.8439 0.0019  0.9093 0.0039  0.9241 0.0022  0.8854 0.0017  0.9128 0.0073 
PiWiLiD10  0.8441 0.0018  0.9076 0.0040  0.9226 0.0024  0.8870 0.0023  0.9132 0.0070 
PiWiLiD15  0.8432 0.0017  0.9075 0.0043  0.9244 0.0025  0.8857 0.0019  0.9136 0.0077 
PiWiLiD20  0.8455 0.0019  0.9085 0.0040  0.9221 0.0024  0.8872 0.0018  0.9142 0.0071 
MLP  0.8428 0.0043  0.9071 0.0036  0.9225 0.0056  0.8867 0.0029  0.9146 0.0065 
SVM (rbf kernel)  0.8029 0.0223  0.8357 0.0078  0.8943 0.0014  0.8468 0.0016  0.8637 0.0013 
Random Forest  0.8503 0.0036  0.8997 0.0094  0.9264 0.0064  0.8929 0.0056  0.9088 0.0054 
Logistic Regression  0.7921 0.0013  0.8074 0.0055  0.8633 0.0021  0.7989 0.0019  0.8215 0.0015 
GAM  0.8398 0.0014  0.8867 0.0028  0.9119 0.0036  0.8386 0.0025  0.8953 0.0034 
# features  10  20  30  40  50 

PiWiLiD1  0.8401 0.0077  0.9458 0.0044  0.9271 0.0056  0.9116 0.0713  0.9404 0.0073 
PiWiLiD5  0.8402 0.0079  0.9454 0.0043  0.9272 0.0052  0.9085 0.0649  0.9403 0.0073 
PiWiLiD10  0.8401 0.0079  0.9456 0.0042  0.9271 0.0054  0.9136 0.0473  0.9408 0.0072 
PiWiLiD15  0.8399 0.0078  0.9479 0.0040  0.9275 0.0055  0.9071 0.0594  0.9409 0.0074 
PiWiLiD20  0.8412 0.0082  0.9481 0.0040  0.9276 0.0054  0.9128 0.0458  0.9410 0.0071 
MLP  0.8428 0.0080  0.9463 0.0041  0.9265 0.0055  0.9107 0.0679  0.9418 0.0079 
SVM (rbf kernel)  0.7229 0.0112  0.8526 0.0078  0.8094 0.0101  0.7940 0.0646  0.8874 0.0023 
Random Forest  0.7403 0.0099  0.8297 0.0067  0.8738 0.0071  0.8653 0.0462  0.8588 0.0094 
Logistic Regression  0.7866 0.0113  0.8044 0.0065  0.6263 0.0162  0.6953 0.0570  0.8195 0.0035 
GAM  0.7689 0.0074  0.8750 0.0058  0.9175 0.0046  0.8756 0.0652  0.9212 0.0074 
# features  10  20  30  40  50 

PiWiLiD1  0.7281 0.0061  0.9064 0.0027  0.9057 0.0042  0.9406 0.0028  0.9206 0.0047 
PiWiLiD5  0.7307 0.0061  0.9065 0.0027  0.9059 0.0042  0.9411 0.0031  0.9205 0.0056 
PiWiLiD10  0.7320 0.0071  0.9066 0.0028  0.9064 0.0039  0.9406 0.0030  0.9209 0.0054 
PiWiLiD15  0.7329 0.0070  0.9066 0.0029  0.9067 0.0042  0.9411 0.0031  0.9212 0.0074 
PiWiLiD20  0.7337 0.0067  0.9068 0.0029  0.9081 0.0040  0.9420 0.0030  0.9215 0.0071 
MLP  0.7297 0.0057  0.9086 0.0032  0.9078 0.0038  0.9417 0.0032  0.9211 0.0048 
SVM (rbf kernel)  0.7077 0.0086  0.7782 0.0042  0.8094 0.0101  0.8007 0.0066  0.8874 0.0023 
Random Forest  0.7103 0.0086  0.7790 0.0080  0.7484 0.0074  0.8043 0.0102  0.8588 0.0094 
Logistic Regression  0.6082 0.0049  0.7579 0.0030  0.7138 0.0043  0.7981 0.0049  0.8096 0.0086 
GAM  0.6808 0.0072  0.8349 0.0041  0.8520 0.0043  0.8854 0.0047  0.8644 0.0084 
6.3 Realworld Datasets
The datasets used in main paper can be downloaded from the following websites:

Bank marketing: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
References
 [1] (2018) Towards robust interpretability with selfexplaining neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 7786–7795. Cited by: §1.
 [2] (2016) Interpretable deep models for icu outcome prediction. In AMIA Annual Symposium Proceedings, Vol. 2016, pp. 371. Cited by: §1, §2.2.
 [3] (2016) Wide and deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 7–10. Cited by: §1, §2.1.
 [4] (2018) DeepFM: an endtoend wide & deep learning framework for ctr prediction. arXiv preprint arXiv:1804.04950. Cited by: §2.1.
 [5] (1986) Generalized additive models. Statistical Science 1 (3), pp. 297–318. Cited by: §2.2.

[6]
(2019)
Forecasting remaining useful life: interpretable deep learning approach via variational bayesian inferences
. Decision Support Systems 125, pp. 113100. Cited by: §2.2.  [7] (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §3.2.
 [8] (2015) Interpretable classifiers using rules and Bayesian analysis: building a better stroke prediction model. The Annals of Applied Statistics 9 (3), pp. 1350–1371. Cited by: §1, §2.2.

[9]
(2019)
Combining decision trees and neural networks for learningtorank in personal search
. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2032–2040. Cited by: §1, §2.1, §2.2.  [10] (2013) Accurate intelligible models with pairwise interactions. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 623–631. Cited by: §4.2.
 [11] (2012) Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–158. Cited by: §2.2, §4.2.
 [12] (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §1, §2.2.
 [13] (2019) Interpretable machine learning: definitions, methods, and applications. arXiv preprint arXiv:1901.04592. Cited by: §1, §1, §2.2.
 [14] (2016) Why should I trust you? explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §1, §2.2.
 [15] (2019) CXPlain: causal explanations for model interpretation under uncertainty. In Advances in Neural Information Processing Systems, pp. 10220–10230. Cited by: §2.2.
 [16] (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3145–3153. Cited by: §1, §2.2.
 [17] (2017) Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3319–3328. Cited by: §1, §2.2.
 [18] (2018) Learning global additive explanations for neural nets using model distillation. arXiv preprint arXiv:1801.08640. Cited by: §1, §4.2.
 [19] (2018) Detecting statistical interactions from neural network weights. International Conference on Learning Representations. Cited by: §1, §2.2.
 [20] (2018) Neural interaction transparency (nit): disentangling learned interactions for improved interpretability. In Advances in Neural Information Processing Systems, pp. 5804–5813. Cited by: §3.3, §4.2.
 [21] (2019) Hybrid predictive model: when an interpretable model collaborates with a blackbox model. Journal of Machine Learning Research 1, pp. 1–48. Cited by: §1, §2.1, §2.2.
 [22] (2015) Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579. Cited by: §1, §2.2.

[23]
(2017)
Wide and deep convolutional neural networks for electricitytheft detection to secure smart grids
. IEEE Transactions on Industrial Informatics 14 (4), pp. 1606–1615. Cited by: §2.1.