1 Introduction
Thanks to the recent advances in machine learning research, the predictive performance by machine learning has continually improved, and the advances present results that machine learning has surpassed human’s performance in some tasks in the field of computer vision and natural language processing
[he2016, zhang2019]. Therefore, machine learning has recently come into wide use in services and products.Recently, in supervised learning, deep neural networks (DNNs) have been essential to achieving high predictive performance. In general, the benefits of DNNs are: 1) they are able to automatically extract meaningful features from the original representation of inputs, for example, RGB values of pixels from images and sequences of words in documents; 2) they achieve high predictive performance using the extracted features; and 3) they connect the feature extraction and the prediction seamlessly and learn their parameters efficiently via backpropagation. However, as DNNs generally have an extremely large number of parameters, and are made of multiple nonlinear functions, it is difficult for humans to interpret how such predictions are performed by DNNs.
For reliability, it is important that the predictions made by machine learning methods are interpretable by humans, i.e., one can understand what features are important for the predictions and how the predictions are made. Therefore, in recent years interpretable machine learning has been intensively studied in not only the machine learning community [wilson2017] but also in various other areas such as computer vision [escalante2018], natural language processing [mathews2019], as well as in materials and medical science [dimiduk2018, tjoa2019]. The representative methods for interpretable machine learning are Local Interpretable Modelagnostic Explainations (LIME) [ribeiro2016] and SHapley Additive exPlanations (SHAP) [lundberg2017]. However, as stated in the paper [ribeiro2018], these methods have mainly two drawbacks: 1) pseudosamples used for interpretable model learning might not be part of the original input distribution, and 2) the learning of the interpretable models is timeconsuming because it is executed in the test phase.
Linear models with feature selection such as Lasso
[tibshirani1996]are easy for humans to interpret their predictions, as each of their own weights indicates the importance of its corresponding feature. However, in cases where data distribution is intrinsically nonlinear, the predictive performance of the linear models would be low due to the low flexibility of their linear function. Moreover, we sometimes need to transform the original representation so that it is compatible with linear models. For example, in text classification tasks, text documents are originally represented by sequences of words with various length, they are transformed to bagofwords vectors with fixed length with vocabulary size. This would remove important features included in the original representation.
To combine both the benefits of the high predictive performance of DNNs and high interpretability of linear models into a single model, in this paper we propose Neural Generators of Sparse Local Linear models (NGSLLs). The NGSLL generates sparse linear weights for each sample using DNNs, and then makes predictions using the weights. The overview of the proposed model is illustrated in Figure 1. The model takes two types of representations as inputs: the original representation (e.g., word sequence) and its simplified one (e.g., bagofwords). For these inputs, it first generates samplewise dense weights from the original representation using the Weight Generator Network (WGN). Then, the generated dense weights are fed into the Hot Gate Module (HGM) to generate a gate vector that represents important features and eliminates the weights associated with the remaining irrelevant features. The dense weights become sparse by the elementwise product with the gate vector. Finally, it outputs the prediction by the inner product between the sparse weights and the simplified representation. The parameters of the model are endtoend trainable via backpropagation. To be able to robustly learn the parameters when is set to quite a small value, we learn them based on a coarsetofine training procedure.
The principal benefit of the NGSLL is to be able to exploit the rich information in the original representation for the prediction using the simplified representation. Let us consider sentiment classification on movie review texts, and we are given a sequence of words as the original representation and the binary vector of bagofwords as the simplified one. The sentiment polarities for the words are changed by their own contexts. For example, the polarity of the word “funny” should be positive for the context “This is a funny comedy”, and negative in the context of: “The actor is too bad to funny”. Although the difference of the polarities cannot be affected to the weights of ordinary linear models, the NGSLL can capture it by generating samplewise weights using the WGN. Additionally, the generated weights are interpretable because they are the parameters of linear models and only a small number of weights are selected by the HGM.
We demonstrate the effectiveness of the NGSLL by conducting experiments on image and text classification tasks. The experimental results show that 1) its predictive performance is significantly better than interpretable linear models and the recent models based on neural networks for interpretable machine learning, 2) samplewise sparse weights generated from the NGSLL are appropriate and helpful to interpret why each of samples is classified as such, and 3) the prediction of NGSLL is computationally more efficient than LIME and SHAP.
2 Related Work
In this section, we introduce the existing research related to our own work and explain the differences between our work and them.
In a modelagnostic setting, it is assumed that machine learning models with arbitrary architectures such as DNNs and random forests are acceptable as the prediction models, and that the parameters of the prediction models and their gradients are therefore not available for interpretation. Thus, the goal of this setting is to explain the prediction model by making use of the relationship between its inputs and the outputs of the prediction model for the inputs. For example, Chen et al. proposed explaining the prediction models by finding important features in inputs based on mutual information between their response variables and the outputs of the prediction model for those inputs
[chen2018].Recently, perturbationbased approach has been wellstudied in a modelagnostic setting [ribeiro2016, lundberg2017, ribeiro2018, fong2017, adler2018]. The approach attempts to explain the sensitivity of the prediction models to changes in their inputs by using pseudosamples generated by perturbations. The representative methods in this approach are LIME [ribeiro2016] and SHAP [lundberg2017]. LIME randomly draws pseudosamples from a distribution centered at a simplified input to be explained, and fits a linear model to predict the outputs of the prediction model for the pseudosamples. SHAP is built upon the idea of LIME, and uses Shapley values to quantify the importance of features of a given input. However, as described in the introduction, these methods have drawbacks in that the pseudosamples might be different from the original input distributions, and the learning of the linear model is timeconsuming.
Our work considers not modelagnostic approach, i.e., the prediction and interpretation models can be learned simultaneously or the prediction model itself is interpretable. As the NGSLL finishes learning its own parameters using only the original input data in the training phase, it can avoid the drawbacks of the perturbationbased approach.
Attentive Mixtures of Experts (AME) is related to our work [schwab2019]. With AME, each feature in the input is associated with an expert, and prediction is made by the weighted sum of the outputs of the experts. By learning the weights by optimizing an objective inspired by Grangercausality, AME executes the selection of important features and the prediction simultaneously. The functional differences between the NGSLL and AME are 1) the NGSLL can make use of the original representation of inputs, and 2) the NGSLL can specify the number of features used in prediction.
In the computer vision community, explanation methods to highlight the attentive region of an image for its prediction generated by CNNs have been actively studied. For example, Class Average Mapping (CAM) [zhou2016] computes channelwise importance by inserting global average pooling before the last fullyconnected layer of CNNs and shows which regions are attentive by visualizing weighted sum of channels with the importance. GradCAM [selvaraju2017] is a popular gradientbased approach that determines the attentive regions based on the gradients for the outputs of CNNs. Although these approaches are successfully used in various applications with images, they especially focus on the explanation of CNNs for images. Unlike these approaches, the NGSLL is a general framework for interpretable machine learning, which can be used independently on the types of data. Moreover, the NGSLL can incorporate domainspecific knowledge, e.g., the CNNs optimized for images and LSTMbased architectures for natural languages, into the WGN.
Finally, we introduce the relationship between the NGSLL and local linear models which assigns weights to each sample. Some works proposed to learn them with regularization such that similar samples have similar weights [hallac2015, yamada2017]. Here, in the setting of their studies, the similarity between the samples are defined in advance. The NGSLL can be regarded as employing the regularization implicitly without the similarity, as similar samplewise weights are generated for similar samples using a shared DNN as the WGN.
3 Proposed Model
In this section, we describe the NGSLL in details. For sake of explanation, we mainly consider the problem of binary classification, but it can be easily extended for regression and multiclass classification.
Suppose that we are given training data containing samples. For ease of explanation, we denote as , respectively, by omitting the index. We assume that two types of inputs are given. The first one, , is the original representation of a sample, which contains rich information representing the sample. In case of text classification, is for example a sequence of words or characters. The second one, , is a simplified representation for and should be made of the essence of and easy to understand the meaning of each dimension of it. For example, bagofwords representation is one of the simplified one for text. Then, denotes a class label. Here, if is sparse, mask vector can be used for excluding predefined irrelevant features from . For the case where is a bagofwords vector, we represent if the th dimension of is zero, that is, its corresponding feature has zero frequency, and otherwise. In test phase, we assume that both and for a test sample are given. Our aim is to generate samplewise sparse linear weights such that high predictive performance and the interpretability of the prediction are archived simultaneously.
Before explaining the details of the NGSLL, we summarize the computation flow of the model from inputs to prediction by reference to Figure 1. First, the model generates samplewise weight vector from original representation using Weight Generator Network (WGN). Then, and mask vector are fed into the Hot Gate Module (HGM) to generate gate vector where is defined as
(1) 
Here, indicates that the gate associated with the th dimension of is open. The role of the HGM is to leave only dimensions in necessary for the prediction and to remove the remaining dimensions. Using and , samplewise sparse weight vector is obtained by
(2) 
where denotes elementwise product operator. Finally, the prediction for the sample is made by the following equation,
(3) 
In the consecutive two subsection, we describe the details of the two key parts, the WGN and the HGM, in the NGSLL.
3.1 Weight Generator Network (WGN)
The WGN is used for generating samplewise weight vector from the original representation for a sample, . More specifically, extracting features from and generating weights from the features are executed consecutively in the WGN. Since the feature extraction is a common process for DNNs, we can reuse the existing network architecture and its pretrained model that are good for solving the supervised learning task related to what we want to solve. Then, in order to transform the features to , we stack fullyconnected (FC) layers on the network for the feature extraction.
3.2 Hot Gate Module (Hgm)
A naive solution for making prediction using only important features is to obtain the top values in the absolute values of the generated weight vector, , and then use only their weights in (3). However, this solution does not necessarily select important features correctly, because it prevents learning models in an endtoend manner. To avoid this drawback, the NGSLL uses the HGM. Figure 2 illustrates an example for the behavior of the HGM with .
The HGM is given two types of inputs, weight vector generated by the WGN and mask vector . When finding important features using , its absolute values are meaningful. However, since the absolute function is not differentiable at zero, we use its elementwise square value instead. To generate hot gate vector such that the dimensions associated with the largest values in are one, indifferentiable operation is needed. However, this prevents endtoend training of the proposed model via backpropagation. To avoid this drawback, we employ Gumbel softmax [jang2016], a continuous approximation for . We generate by sequentially sampling onehot vectors using Gumbel softmax and then summing up them, that is,
(4) 
Here, the value of the th element for the th gate vector, , can be calculated as follows:
(5) 
where is a temperature parameter and is a sample drawn from the standard Gumbel distribution, which can be obtained by the following processes:
(6) 
As shown in (5), Gumbel softmax requires the logarithm of a normalized dimensional probability vector. In the NGSLL, we calculate it as . Then, the model changes it over to prevent the same elements in the already generated onehot gate vectors from being one multiple times. We denote the th one as whose the th element is calculated as follows:
(7) 
where leads to be in (5) because of . Here, indicates whether the th elements in are already being one () or not (). More specifically, given the th mask vector and gate vector , the next mask vector is obtained by
(8) 
where we define , and is a onehot vector being one only at the th element and zero at the other elements.
3.3 CoarsetoFine Training
With the NGSLL, the parameters to be learned are only those of WGN and their estimation can be done by minimizing the cross entropy loss between true label
and predicted onein a standard supervised way for deep learning.
However, when is set to a quite small value, e.g., , an ordinary training often did not work well in our preliminary experiments. This is because the gradients for the dimensions having zero in gate vector are not propagate to the WGN. More specifically, in the computation of the backpropagation, we need the gradient vector for (2) in terms of parameter of the WGN, which is calculated as follows:
(9) 
From this equation, the values of the gradient vector are obviously determined by , and the dimensions with zero in also become zero in the gradient vector. Due to this, with getting smaller, many parameters cannot be updated enough and the learning will be converged at a poor local optimum. We overcome this difficulty by employing a coarsetofine training procedure in which coarse training phase and fine one are executed in a stepbystep manner.
In the coarse training phase, we first set to a high value, e.g., , and temperature parameter for Gumbel softmax (5) to a relatively large value, e.g., . Here, with getting larger, the gate vector
is closer to a sample from an uniform distribution. By learning the parameters under such a setting, we can prevent the emergence of zero gradients. As a result, the parameters can be sufficiently updated, although the effect of the
HGM is barely affected to the parameters yet.After the loss is converged in the coarse training phase, we move to the fine training phase. We reset and to one’s desired values, and then restart to learn the model from the converged point in the coarse training phase.
4 Experiments
In this section, we show the experimental results of image and text classification tasks. The aim of the experiments is to show that: 1) the NGSLL can achieve high predictive performance; 2) the sparse weights generated by the NGSLL are appropriate and helpful for the interpretation of prediction; and 3) the NGSLL is computationally efficient in the test phase. All the experiments were done with a computer with Intel Core i9 7900X 3.30GHz CPU, NVIDIA GeForce RTX2080Ti GPU, and 64GB of main memory.
Setup of the NGSLL.
The hyperparameters of the NGSLL to be tuned are the number of FC layers connecting between the WGN and the
HGM, , the number of units for the FC layers, , and temperature parameter for Gumbel softmax, . With and , we assign their values from and , we choose the optimal ones which achieved the best accuracy on the validation set for each dataset. In the experiments we set to and in the coarse and fine training, respectively. The parameters of the NGSLL are optimized by Adam [kingma2014] with the default hyperparameters in the coarse training and Momentum SGD [qian1999]with a decay rate of the first order moment,
, in the fine training. Here, we set the learning rate of the Momentum SGD to one tenth of the final learning rate of Adam.NGSLL  0.994 ()  0.995 ()  0.994 ()  


0.967 ()  0.991 ()  0.991 ()  
Ridge  0.526 ()  0.561 ()  0.582 ()  
Lasso  0.523 ()  0.549 ()  0.582 ()  
AME ()  0.973 ()  
AME ()  0.977 ()  
DNN  0.992 () 
Predictive accuracy on binaryclass MNIST dataset. Numbers in brackets are their standard deviations. Note that, the accuracy of AME and DNN is constant to
.Comparing models.
We compare the NGSLL with various interpretable models such as Lasso [tibshirani1996], Ridge [hoerl1970], AME [schwab2019], LIME [ribeiro2016] and SHAP [lundberg2017]. Lasso and Ridge are linear models with and regularization, respectively. Unlike the NGSLL, the weights of their models are common across all the samples. We use scikitlearn^{1}^{1}1https://scikitlearn.org/stable/ for their implementations, and their hyperparameter, the strength of regularization, is optimized by cross validation. AME is a method for estimating feature importance with neural networks based on the attentive mixture of experts. We use the authors’ implementation^{2}^{2}2https://github.com/d909b/ame for the MNIST dataset with a small change. AME can control the strength of the regularization for the feature importance by . As with the original paper on AME, we set to 0 and 0.05. LIME and SHAP are modelagnostic methods and learn samplewise weights for each test sample using the predictions for its perturbed samples. We use the authors’ implementations for LIME^{3}^{3}3https://github.com/marcotcr/lime and SHAP^{4}^{4}4https://github.com/slundberg/shap. As these methods do not produce prediction models, we use them for comparing computational efficiency in the test phase. For reference, we also compare the NGSLL with standard approaches using DNN with original representation as input. The DNN architecture is almost the same as the WGN of the NGSLL, but its last FC layer directly outputs label probabilities instead of samplewise weights. Since the DNN is not interpretable, it is inappropriate for our task, but it can be regarded as the upper bound for the predictive performance of the NGSLL. To evaluate the impact of the HGM, we also compare it with the NGSLL without HGM. As it generates samplewise weights through the HGM, the weights become dense. With Lasso, Ridge and the NGSLL without HGM, evaluating accuracy at is performed by using weights with the largest absolute values for the weights.
4.1 Handwritten Digits Image Classification
Dataset preparation.
In this experiments, we use MNIST [lecun1998], a standard benchmark dataset for handwritten digits classification. Although the original MNIST is made of 10 classes corresponding to digits from 0 to 9, we split the classes into two groups, i.e., we reassign for the images with digits 0 to 4 and for the images with digits 5 to 9. This is to show that the NGSLL can generate sparse weights appropriate for each image, because the weights should be different even among the images in the same class due to large changes in their digits. We call this binaryclass MNIST. The original MNIST contains 60,000 training images and 10,000 test images. With the binaryclass MNIST, we use 5,000 images randomly selected from the training images as validation images, and the remaining 55,000 images as training images. In such a fashion, we create five training and validation sets with different random seeds to evaluate the robustness of the NGSLL. We use the original test images for evaluation. The images are grayscale and their resolution is 28 28. Although each pixel of the images has an integer value of 0 to 255, we transform it to and use the image as original representation . Its simplified representation is constructed by downsampling to and converting it to vector .
WGN architecture.
We use a simple threelayer CNN architecture as the WGN. Each layer is constructed by a sequence of convolution, ReLU and max pooling modules. The same architecture is used in DNN for fair comparison. After the output of the third layer in the WGN, it is converted to weight vector
via FC layers.Predictive performance.
NGSLL  2.55 ()  6.19 ()  10.76 () 

SHAP  699.65 ()  
LIME  1109.95 () 
Table 1 shows the predictive performance of each model on the binaryclass MNIST dataset. The table presents the following three facts. First of all, the NGSLL outperforms the other models regardless of the values. As the accuracy of Ridge and Lasso shows, it is very hard to solve this task by such ordinary linear models with the simplified representation. Although the NGSLL also uses the simplified representation in prediction (3), it can achieve high accuracy by using samplewise weights generated by the WGN. Second, the accuracy of the NGSLL is comparable to that of the DNN. This implies that the NGSLL can effectively transfer useful information included in the original representation into its simplified one, without information loss. Third, even when , the NGSLL can achieve high accuracy, although the accuracy of the NGSLL without HGM decreases. This is because the NGSLL can learn its parameters optimized to the value due to employing the HGM.
Weights visualization.
Figure 3 shows the weights generated by the NGSLL for the image of each digit. For each image in binaryclass MNIST, it is important for correct prediction to assign weights with the sign of its label to black pixels in its original representation. With the NGSLL without HGM, the weights with the correct sign can be generated. However, it is difficult to interpret which regions in the image are important for prediction since large weights are assigned to many regions. On the other hand, the NGSLL in cases of both and can appropriately generate sparse weights only on black pixels by capturing the shape of the digits displayed in the images. Thanks to the sparse weights, one can easily understand which regions in the image are useful in prediction.
Computational time.
Table 2 shows the average running times taken to interpretation for a sample. Note again that, as SHAP and LIME are modelagnostic, they learn a linear model for interpretation by making use of the outputs of the given prediction model for the perturbed samples of the given sample. Here, the prediction model for SHAP and LIME in this experiment is random forests, and its running time for a sample is 0.6 ms in average. As shown in the table, the NGSLL can generate linear weights for interpretation about 270430 times faster than SHAP and LIME. We also found that the running times of the NGSLL increase with increasing the values. This is because the HGM generates onehot vectors repeatedly, and therefore, the times increase linearly with respect to .
4.2 Text Classification
Dataset preparation.
In this experiments, we use three text classification datasets, referred to as MPQA [wiebe2005], Subj [pang2002] and TREC [li2002]. The MPQA is a dataset to predict its opinion polarity (positive or negative) from a phrase. It contains 10,603 phrases in total. The Subj is a dataset to predict subjectivity (subjective or objective) from a sentence about a movie. It contains 10,000 sentences in total. The TREC is a dataset to predict the type of its answer from a factoid question sentence. It contains 5,952 sentences in total, and each sentence has one out of six answer types: abbreviation, entity, description, location and numeric value. With each of the datasets, we create five sets by randomly splitting it into training, validation and test parts. We in advance eliminate stopwords and the words appearing only once in each of the datasets. As a result, the vocabulary size of the MPQA, the Subj and the TREC is 2,661, 9,956 and 3,164, respectively. Each sentence and phrase is used as a word sequence of it for the original representation, while a bagofwords vector of it for the simplified one. Here, in the case where unknown words appear in the word sequence, we replace it with a special symbol.
WGN architecture.
We use sentence CNN [kim2014], a popular and standard CNN architecture used for sentence classification, as the WGN. All the parameters of the WGN are learned from a scratch. As with the experiments for image classification, the same architecture is used in DNN.
Predictive performance.
Figure 4 shows the accuracy of each model with varying on the three text datasets. We obtained similar findings with those on the binaryclass MNIST in the following two points of view. First, the NGSLL outperforms Ridge and Lasso regardless of the values. This indicates that the WGN in the NGSLL takes advantage of the information of word sequences for accurate prediction. Secondly, the NGSLL can maintain high accuracy even when becomes a small value, as shown especially in the Subj and the TREC. Conversely, the different result with the those on the binaryclass MNIST is that the accuracy of the NGSLL is lower than the one of DNN on the MPQA and the TREC, while their accuracy are almost the same on the Subj. Although the reason is not trivial, this result suggests that the accuracy gap between the NGSLL and DNN would emerge depending on dataset choice and WGN architecture.
Weights visualization.
Figure 5 shows the visualization results of the sparse weights generated by the NGSLL with KHGM for the sentences on the TREC. As the TREC is for the task of predicting an answer type from a sentence, large weights should be assigned to words suitable for predicting the answer type. As shown in the results when , the NGSLL can find the representative word for each answer type, e.g., “far” for answer type (b) numerical value. Although irrelevant words for the answer type tend to be remained even when the value becomes large, the representative word is consistently selected as an important one.
5 Conclusion
In this paper we have proposed neural generators of sparse local linear models (NGSLLs), where we bring both the benefits of high predictive performance of DNNs and high interpretability of linear models. The NGSLL generates sparse and interpretable linear weights for each sample using a DNN, and then makes prediction using the weights. Here, the weights associated with only important features become large and the remaining weights shrink to zero by using our gating module based on Gumbel softmax. According to the experimental results on image and text classification tasks, we have found that 1) the NGSLL can robustly achieve high accuracy regardless of the values, 2) the NGSLL can provide reasonable evidences for prediction by visualizing the weights, and 3) the prediction by the NGSLL is computationally efficient.
Acknowledgment
A part of this work was supported by JSPS KAKENHI Grant Number JP18K18112.
Comments
There are no comments yet.