1. Introduction
More and more machine learning systems are making significant decisions routinely in important domains, such as medical practice, autonomous driving, criminal justice, and military decision making
(Goodfellow et al., 2016). As the impact of machinemade decisions increases, the demand on clear interpretations of machine learning systems is growing ever stronger against the blind deployments of decision machines (Goodman and Flaxman, 2016). Accurately and reliably interpreting a machine learning model is the key to many significant tasks, such as identifying failure models (Agrawal et al., 2016), building trust with human users (Ribeiro et al., 2016), discovering new knowledge (Rather et al., 2017), and avoiding unfairness issues (Zemel et al., 2013).The interpretation problem of machine learning models has been studied for decades. Conventional models, such as Logistic Regression and Support Vector Machine, have all been well interpreted from both practical and theoretical perspectives
(Bishop, 2007). Powerful nonnegative and sparse constraints are also developed to enhance the interpretability of conventional models by sparse feature selection
(Lee et al., 2007; Hoyer, 2002). However, due to the complex network structure of a deep neural network, the interpretation problem of modern deep models is yet a challenging field that awaits further exploration.As to be reviewed in Section 2, the existing studies interpret a deep neural network in three major ways. The hidden neuron analysis methods (Mahendran and Vedaldi, 2015; Yosinski et al., 2015; Ngiam et al., 2011; Dosovitskiy and Brox, 2016) analyze and visualize the features learned by the hidden neurons of a neural network; the model mimicking methods (Ba and Caruana, 2014; Che et al., 2015; Hinton et al., 2015; Bastani et al., 2017) build a transparent model to imitate the classification function of a deep neural network; the local explanation methods (Shrikumar et al., 2017; Fong and Vedaldi, 2017; Sundararajan et al., 2017; Smilkov et al., 2017) study the predictions on local perturbations of an input instance, so as to provide decision features for interpretation. All these methods gain useful insights into the mechanism of deep models. However, there is no guarantee that what they compute as an interpretation is truthfully the exact behavior of a deep neural network. As demonstrated by Ghorbani (Ghorbani et al., 2017), most existing interpretation methods are inconsistent and fragile, because two perceptively indistinguishable instances with the same prediction result can be easily manipulated to have dramatically different interpretations.
Can we compute an exact and consistent interpretation for a pretrained deep neural network? In this paper, we provide an affirmative answer, as well as an elegant closed form solution for the family of piecewise linear neural networks. Here, a piecewise linear neural network (PLNN) (Harvey et al., 2017)
is a neural network that adopts a piecewise linear activation function, such as MaxOut
(Goodfellow et al., 2013)and the family of ReLU
(Glorot et al., 2011; Nair and Hinton, 2010; He et al., 2015). The wide applications (LeCun et al., 2015) and great practical successes (Krizhevsky et al., 2012) of PLNNs call for exact and consistent interpretations on the overall behaviour of this type of neural networks. We make the following technical contributions.First, we prove that a PLNN is mathematically equivalent to a set of local linear classifiers, each of which being a linear classifier that classifies a group of instances within a convex polytope in the input space. Second, we propose a method named to provide an exact interpretation of a PLNN by computing its equivalent set of local linear classifiers in closed form. Third, we interpret the classification result of each instance by the decision features of its local linear classifier. Since all instances in the same convex polytope share the same local linear classifier, our interpretations are consistent per convex polytope. Fourth, we also apply to study the effect of nonnegative and sparse constraints on the interpretability of PLNNs. We find that a PLNN trained with these constraints selects meaningful features that dramatically improve the interpretability. Last, we conduct extensive experiments on both synthetic and realworld data sets to verify the effectiveness of our method.
2. Related Works
How to interpret the overall mechanism of deep neural networks is an emergent and challenging problem.
2.1. Hidden Neuron Analysis Methods
The hidden neuron analysis methods (Mahendran and Vedaldi, 2015; Yosinski et al., 2015; Ngiam et al., 2011; Dosovitskiy and Brox, 2016) interpret a pretrained deep neural network by visualizing, revertmapping or labeling the features that are learned by the hidden neurons.
Yosinski et al. (Yosinski et al., 2015) visualized the live activations of the hidden neurons of a ConvNet, and proposed a regularized optimization to produce a qualitatively better visualization. Erhan et al. (Erhan et al., 2009) proposed an activation maximization method and a unit sampling method to visualize the features learned by hidden neurons. Cao et al. (Cao et al., 2015) visualized a neural network’s attention on its target objects by a feedback loop that infers the activation status of the hidden neurons. Li et al. (Li et al., 2015)
visualized the compositionality of clauses by analyzing the outputs of hidden neurons in a neural model for Natural Language Processing.
To understand the features learned by the hidden neurons, Mahendran et al. (Mahendran and Vedaldi, 2015) proposed a general framework that revertmaps the features learned from an image to reconstruct the image. Dosovitskiy et al. (Dosovitskiy and Brox, 2016) performed the same task as Mahendran et al. (Mahendran and Vedaldi, 2015)
did by training an upconvolutional neural network.
Zhou et al. (Zhou et al., 2017) interpreted a CNN by labeling each hidden neuron with a best aligned humanunderstandable semantic concept. However, it is hard to get a golden dataset with accurate and complete labels of all human semantic concepts.
The hidden neuron analysis methods provide useful qualitative insights into the properties of each hidden neuron. However, qualitatively analyzing every neuron does not provide much actionable and quantitative interpretation about the overall mechanism of the entire neural network (Frosst and Hinton, 2017).
2.2. Model Mimicking Methods
By imitating the classification function of a neural network, the model mimicking methods (Ba and Caruana, 2014; Che et al., 2015; Hinton et al., 2015; Bastani et al., 2017) build a transparent model that is easy to interpret and achieves a high classification accuracy.
Ba et al. (Ba and Caruana, 2014) proposed a model compression method to train a shallow mimic network using the training instances labeled by one or more deep neural networks. Hinton et al. (Hinton et al., 2015)
proposed a distillation method that distills the knowledge of a large neural network by training a relatively smaller network to mimic the prediction probabilities of the original large network. To improve the interpretability of distilled knowledge, Frosst and Hinton
(Frosst and Hinton, 2017) extended the distillation method (Hinton et al., 2015)by training a soft decision tree to mimic the prediction probabilities of a deep neural network.
Che et al. (Che et al., 2015) proposed a mimic learning method to learn interpretable phenotype features. Wu et al. (Wu et al., 2018) proposed a tree regularization method that uses a binary decision tree to mimic and regularize the classification function of a deep timeseries model.
The mimic models built by model mimicking methods are much simpler to interpret than deep neural networks. However, due to the reduced model complexity of a mimic model, there is no guarantee that a deep neural network with a large VCdimension (Sontag, 1998; Koiran and Sontag, 1996; Harvey et al., 2017) can be successfully imitated by a simpler shallow model. Thus, there is always a gap between the interpretation of a mimic model and the actual overall mechanism of the target deep neural network.
2.3. Local Interpretation Methods
The local interpretation methods (Shrikumar et al., 2017; Fong and Vedaldi, 2017; Sundararajan et al., 2017; Smilkov et al., 2017) compute and visualize the important features for an input instance by analyzing the predictions of its local perturbations.
Simonyan et al. (Simonyan et al., 2013) generated a classrepresentative image and a classsaliency map for each class of images by computing the gradient of the class score with respect to an input image. Ribeiro et al. (Ribeiro et al., 2016) proposed LIME to interpret the predictions of any classifier by learning an interpretable model in the local region around the input instance.
Zhou et al. (Zhou et al., 2016) proposed CAM to identify discriminative image regions for each class of images using the global average pooling in CNNs. Selvaraju et al. (Selvaraju et al., 2016) generalized CAM (Zhou et al., 2016) by GradCAM, which identifies important regions of an image by flowing classspecific gradients into the final convolutional layer of a CNN.
Koh et al. (Koh and Liang, 2017) used influence functions to trace a model’s prediction and identify the training instances that are the most responsible for the prediction.
The local interpretation methods generate an insightful individual interpretation for each input instance. However, the interpretations for perspectively indistinguishable instances may not be consistent (Ghorbani et al., 2017), and can be purposefully manipulated by a simple transformation of the input instance without affecting the prediction result (Kindermans et al., 2017).
3. Problem Definition
For a PLNN that contains layers of neurons, we write the th layer of as . Hence, is the input layer, is the output layer, and the other layers , are hidden layers. A neuron in a hidden layer is called a hidden neuron. Let represent the number of neurons in , the total number of hidden neurons in is computed by .
Denote by the th neuron in , by its bias, by its output, and by the total weighted sum of its inputs. For all the neurons in , we write their biases as a vector , their outputs as a vector , and their inputs as a vector .
Neurons in successive layers are connected by weighted edges. Denote by the weight of the edge between the th neuron in and the th neuron in , that is, is an by matrix. For , we compute by
(1) 
Denote by the piecewise linear activation function for each neuron in the hidden layers of . We have for all . We extend to apply to vectors in an elementwise fashion, such that . Then, we compute for all by
(2) 
An input instance of is denoted by , where is a dimensional input space. is also called an instance for short.
Denote by the th dimension of . The input layer contains neurons, where for all . That is, .
The output of is , where is an dimensional output space. The output layer adopts the softmax function to compute the output by .
Notation  Description 
The th neuron in layer .  
The number of neurons in layer .  
The total number of hidden neurons in .  
The input of the th neuron in layer .  
The configuration of the th neuron in layer .  
The th configuration of the PLNN .  
The th convex polytope determined by .  
The th linear classifier that is determined by .  
The set of linear inequalities that define . 
A PLNN works as a classification function that maps an input to an output . It is widely known that is a piecewise linear function (Pascanu et al., 2013; Montufar et al., 2014). However, due to the complex network of a PLNN, the overall behaviour of is hard to understand. Thus, a PLNN is usually regarded as a black box.
How to interpret the overall behavior of a PLNN in a humanunderstandable manner is an interesting problem that has attracted much attention in recent years.
Following a principled approach of interpreting a machine learning model (Bishop, 2007), we regard an interpretation of a PLNN as the decision features that define the decision boundary of . We call a model interpretable if it explicitly provides its interpretation (i.e., decision features) in closed form.
Definition 3.1 ().
Given a fixed PLNN with constant structure and parameters, our task is to interpret the overall behaviour of by computing an interpretable model that satisfies the following requirements.

Exactness: is mathematically equivalent to such that the interpretations provided by truthfully describe the exact behaviour of .

Consistency: provides similar interpretations for classification of similar instances.
Table 1 summarizes a list of frequently used notations.
4. The OpenBox Method
In this section, we describe the method, which produces an exact and consistent interpretation of a PLNN by computing an interpretation model in a piecewise linear closed form.
We first define the configuration of a PLNN , which specifies the activation status of each hidden neuron in . Then, we illustrate how to interpret the classification result of a fixed instance. Last, we illustrate how to interpret the overall behavior of by computing an interpretation model that is mathematically equivalent to .
4.1. The Configuration of a PLNN
For a hidden neuron , the piecewise linear activation function is in the following form.
(3) 
where is a constant integer, consists of linear functions, are constant slopes, are constant intercepts, and is a collection of constant real intervals that partition .
Given a fixed PLNN , an instance determines the value of , and further determines a linear function in to apply. According to which linear function in is applied, we encode the activation status of each hidden neuron by states, each of which uniquely corresponds to one of the linear functions of . Denote by the state of , we have if and only if (). Since the inputs ’s are different from neuron to neuron, the states of different hidden neurons may differ from each other.
Denote by a vector the states of all hidden neurons in . The configuration of is an dimensional vector, denoted by , which specifies the states of all hidden neurons in .
The configuration of a fixed PLNN is uniquely determined by the instance . We write the function that maps an instance to a configuration as .
For a neuron , denote by variables and the slope and intercept, respectively, of the linear function that corresponds to the state . and are uniquely determined by , such that and , if and only if ().
For all hidden neurons in , we write the variables of slopes and intercepts as and , respectively. Then, we rewrite the activation function for all neurons in a hidden layer as
(4) 
where is the Hadamard product between and .
Next, we interpret the classification result of a fixed instance.
4.2. Exact Interpretation for the Classification Result of a Fixed Instance
Given a fixed PLNN , we interpret the classification result of a fixed instance by deriving the closed form of as follows.
By plugging into Equation 1, we rewrite as
(5) 
where , and is an extended version of Hadamard product, such that the entry at the th row and th column of is .
By iteratively plugging Equation 5 into itself, we can write for all as
By plugging and into the above equation, we rewrite , for all , as
(6)  
where is the coefficient matrix of , and is the sum of the remaining terms. The superscript indicates that is equivalent to PLNN’s forward propagation from layer to layer .
Since the output of on an input is , the closed form of is
(7) 
For a fixed PLNN and a fixed instance , and are constant parameters uniquely determined by the fixed configuration . Therefore, for a fixed input instance , is a linear classifier whose decision boundary is explicitly defined by .
Inspired by the interpretation method widely used by conventional linear classifiers, such as Logistic Regression and linear SVM (Bishop, 2007), we interpret the prediction on a fixed instance by the decision features of . Specifically, the entries of the th row of are the decision features for the th class of instances.
Equation 7 provides a straightforward way to interpret the classification result of a fixed instance. However, individually interpreting the classification result of every single instance is far from the understanding of the overall behavior of a PLNN . Next, we describe how to interpret the overall behavior of by computing an interpretation model that is mathematically equivalent to .
4.3. Exact Interpretation of a PLNN
A fixed PLNN with hidden neurons has at most configurations. We represent the th configuration by , where is the set of all configurations of .
Recall that each instance uniquely determines a configuration . Since the volume of , denoted by , is at most , but the number of instances in can be arbitrarily large, it is clear that at least one configuration in should be shared by more than one instances in .
Denote by the set of instances that have the same configuration . We prove in Theorem 4.1 that for any configuration , is a convex polytope in .
Theorem 4.1 ().
Given a fixed PLNN with hidden neurons, , is a convex polytope in .
Proof.
We prove by showing that is equivalent to a finite set of linear inequalities with respect to .
When , we have . For , it follows Equation 6 that , which is a linear function of , because and are constant parameters when is fixed. In summary, given a fixed , is a linear function of for all .
We show that is a convex polytope by showing that is equivalent to a set of linear inequalities with respect to . Recall that if and only if (). Denote by the bijective function that maps a configuration to a real interval in , such that if and only if (). Then, is equivalent to a set of constraints, denoted by . Since is a linear function of and is a real interval, each constraint in is equivalent to two linear inequalities with respect to . Therefore, is equivalent to a set of linear inequalities, which means is a convex polytope. ∎
According to Theorem 4.1, all instances sharing the same configuration form a unique convex polytope that is explicitly defined by linear inequalities in . Since also determines the linear classifier for a fixed instance in Equation 7, all instances in the same convex polytope share the same linear classifier determined by .
Denote by the linear classifier that is shared by all instances in , we can interpret as a set of local linear classifiers (LLCs), each LLC being a linear classifier that applies to all instances in a convex polytope . Denote by a tuple the th LLC, a fixed PLNN is equivalent to a set of LLCs, denoted by . We use as our final interpretation model for .
For a fixed PLNN , if the states of the hidden neurons are independent, the PLNN has configurations, which means contains LLCs. However, due to the hierarchical structure of a PLNN, the states of a hidden neuron in strongly correlate with the states of the neurons in the former layers . Therefore, the volume of is much less than , and the number of local linear classifiers in is much less than . We discuss this phenomenon later in Table 3 and Section 5.4.
In practice, we do not need to compute the entire set of LLCs in all at once. Instead, we can first compute an active subset of , that is, the set of LLCs that are actually used to classify the available set of instances. Then, we can update whenever a new LLC is used to classify a newly coming instance.
Algorithm 1 summarizes the method, which computes as the active set of LLCs that are actually used to classify the set of training instances, denoted by .
Now, we are ready to introduce how to interpret the classification result of an instance . First, we interpret the classification result of using the decision features of (Section 4.2). Second, we interpret why is contained in using the polytope boundary features (PBFs), which are the decision features of the polytope boundaries. More specifically, a polytope boundary of is defined by a linear inequality in . By Equation 6, is a linear function with respect to . The PBFs are the coefficients of in .
We also discover that some linear inequalities in
are redundant whose hyperplanes do not intersect with
. To simplify our interpretation on the polytope boundaries, we remove such redundant inequalities by Caron’s method (Caron et al., 1989) and focus on studying the PBFs of the nonredundant ones.The advantages of are threefold as follows. First, our interpretation is exact, because the set of LLCs in are mathematically equivalent to the classification function of . Second, our interpretation is groupwise consistent. It is due to the reason that all instances in the same convex polytope are classified by exactly the same LLC, and thus the interpretations are consistent with respect to a given convex polytope. Last, our interpretation is easy to compute, since computes by a onetime forward propagation through for each instance in .
5. Experiments
In this section, we evaluate the performance of , and compare it with the stateoftheart method LIME (Ribeiro et al., 2016). In particular, we address the following questions: (1) What are the LLCs look like? (2) Are the interpretations produced by LIME and exact and consistent? (3) Are the decision features of LLCs easy to understand, and can we improve the interpretability of these features by nonnegative and sparse constraints? (4) How to interpret the PBFs of LLCs? (5) How effective are the interpretations of in hacking and debugging a PLNN model?
Table 2 shows the details of the six models we used. For both PLNN and PLNNNS, we use the same network structure described in Table 3, and adopt the widely used activation function: ReLU (Glorot et al., 2011). We apply the nonnegative and sparse constraints proposed by Chorowski et al. (Chorowski and Zurada, 2015) to train PLNNNS. Since our goal is to comprehensively study the interpretation effectiveness of rather than achieving stateoftheart classification performance, we use relatively simple network structures for PLNN and PLNNNS, which are still powerful enough to achieve significantly better classification performance than Logistic Regression (LR). The decision features of LR, LRF, LRNS and LRNSF are used as baselines to compare with the decision features of LLCs.
The Python code of LIME is published by its authors^{1}^{1}1https://github.com/marcotcr/lime. The other methods and models are implemented in Matlab. PLNN and PLNNNS are trained using the DeepLearnToolBox (Palm, 2012). All experiments are conducted on a PC with a Corei73370 CPU (3.40 GHz), 16GB main memory, and a 5,400 rpm hard drive running Windows 7 OS.
We use the following data sets. Detailed information of the data sets is shown in Table 4.
Synthetic (SYN) Data Set. As shown in Figure 1(a), this data set contains 20,000 instances uniformly sampled from a quadrangle in 2dimensional Euclidean space. The red and blue points are positive and negative instances, respectively. We use all instances in SYN as training data to visualize the LLCs of a PLNN.
Models  PLNN  PLNNNS  LR  LRF  LRNS  LRNSF 
NS  
Flip  
Data Sets  # Neurons  PLNN  PLNNNS  
SYN  
FMNIST1  
FMNIST2  
Data Sets  Training Data  Testing Data  
# Positive  # Negative  # Positive  # Negative  
SYN  6,961  13,039  N/A  N/A 
FMNIST1  4,000  4,000  3,000  3,000 
FMNIST2  4,000  4,000  3,000  3,000 
FMNIST1 and FMNIST2 Data Sets. Each of these data sets contains two classes of images in the Fashion MNIST data set (Xiao et al., 2017). FMNIST1 consists of the images of Ankle Boot and Bag. FMNIST2 consists of the images of Coat and Pullover. All images in FMNIST1 and FMNIST2 are 28by28 grayscale images. We represent an image by cascading the 784 pixel values into a 784dimensional feature vector. The Fashion MNIST data set is available online^{2}^{2}2https://github.com/zalandoresearch/fashionmnist.
5.1. What Are the LLCs Look Like?
We demonstrate our claim in Theorem 4.1 by visualizing the LLCs of the PLNN trained on SYN.
Figures 1(a)(b) show the training instances of SYN and the prediction results of PLNN, respectively. Since all instances are used for training, the prediction accuracy is 99.9%.
In Figure 1(c), we plot all instances with the same configuration in the same colour. Clearly, all instances with the same configuration are contained in the same convex polytope. This demonstrates our claim in Theorem 4.1.
Figure 1(d) shows the LLCs whose convex polytopes cover the decision boundary of PLNN and contain both positive and negative instances. As it is shown, the solid lines show the decision boundaries of the LLCs, which capture the difference between positive and negative instances, and form the overall decision boundary of PLNN. A convex polytope that does not cover the boundary of PLNN contains a single class of instances. The LLCs of these convex polytopes capture the common features of the corresponding class of instances. As to be analyzed in the following subsections, the set of LLCs produce exactly the same prediction as PLNN, and also capture meaningful decision features that are easy to understand.
5.2. Are the Interpretations Exact and Consistent?
Exact and consistent interpretations are naturally favored by human minds. In this subsection, we systematically study the exactness and consistency of the interpretations of LIME and on FMNIST1 and FMNIST2. Since LIME is too slow to process all instances in 24 hours, for each of FMNIST1 and FMNIST2, we uniformly sample 600 instances from the testing set, and conduct the following experiments on the sampled instances.
We first analyze the exactness of interpretation by comparing the predictions computed by the local interpretable model of LIME, the LLCs of and PLNN, respectively. The prediction of an instance is the probability of classifying it as a positive instance.
In Figure 2, since LIME does not guarantee zero approximation error on the local predictions of PLNN, the predictions of LIME are not exactly the same as PLNN on FMNIST1, and are dramatically different from PLNN on FMNIST2. The difference of predictions is more significant on FMNIST2, because the images in FMNIST2 are more difficult to distinguish, which makes the decision boundary of PLNN more complicated and harder to approximate. We can also see that the predictions of LIME exceed . This is because the output of the interpretable model of LIME is not a probability at all. As a result, it is arguable that the interpretations computed by LIME may not truthfully describe the exact behavior of PLNN. In contrast, since the set of LLCs computed by is mathematically equivalent to of PLNN, the predictions of are exactly the same as PLNN on all instances. Therefore, the decision features of LLCs exactly describe the overall behavior of PLNN.
Next, we study the interpretation consistency of LIME and by analyzing the similarity between the interpretations of similar instances.
In general, a consistent interpretation method should provide similar interpretations for similar instances. For an instance , denote by the nearest neighbor of by Euclidean distance, by the decision features for the classification of and
, respectively. We measure the consistency of interpretation by the cosine similarity between
and , where a larger cosine similarity indicates a better interpretation consistency.As shown in Figure 3, the cosine similarity of is equal to 1 on about 50% of the instances, because consistently gives the same interpretation for all instances in the same convex polytope. Since the nearest neighbours and may not belong to the same convex polytope, the cosine similarity of is not always equal to 1 on all instances. In constrast, since LIME computes individual interpretation based on the unique local perturbations of every single instance, the cosine similarity of LIME is significantly lower than on all instances. This demonstrates the superior interpretation consistency of .
In summary, the interpretations of are exact, and are much more consistent than the interpretations of LIME.
5.3. Decision Features of LLCs and the Effect of Nonnegative and Sparse Constraints
Besides exactness and consistency, a good interpretation should also have a strong semantical meaning, such that the “thoughts” of an intelligent machine can be easily understood by a human brain. In this subsection, we first show the meaning of the decision features of LLCs, then study the effect of the nonnegative and sparse constraints in improving the interpretability of the decision features. The decision features of PLNN and PLNNNS are computed by . The decision features of LR, LRF, LRNS and LRNSF are used as baselines. Table 5 shows the accuracy of all models.
Figure 4 shows the decision features of all models on FMNIST1. Interestingly, the decision features of PLNN are as easy to understand as the decision features of LR and LRF. All these features clearly highlight meaningful image parts, such as the ankle and heel of Ankle Boot, and the upper left corner of Bag. A closer look at the the average images suggests that these decision features describe the difference between Ankle Boot and Bag.
The decision features of PLNN capture more detailed difference between Ankle Boot and Bag than the decision features of LR and LRF. This is because the LLCs of PLNN only capture the difference between a subset of instances within a convex polytope, however, LR and LRF capture the overall difference between all instances of Ankle Boot and Bag. The accuracies of PLNN, LR and LRF are comparable because the instances of Ankle Boot and Bag are easy to distinguish. However, as to be shown in Figure 5, when the instances are hard to distinguish, PLNN captures much more detailed features than LR and LRF, and achieves a significantly better accuracy.
Figure 5 shows the decision features of all models on FMNIST2. As it is shown, LR and LRF capture decision features with a strong semantical meaning, such as the collar and breast of Coat, and the shoulder of Pullover. However, these features are too general to accurately distinguish between Coat and Pullover. Therefore, LR and LRF do not achieve a high accuracy. Interestingly, the decision features of PLNN capture much more details than LR and LRF, which leads to the superior accuracy of PLNN.
The superior accuracy of PLNN comes at the cost of cluttered decision features that may be hard to understand. Fortunately, applying nonnegative and sparse constraints on PLNN effectively improves the interpretability of the decision features without affecting the classification accuracy.
Data Set  FMNIST1  FMNIST2  
Accuracy  Train  Test  Train  Test 
LR  0.998  0.997  0.847  0.839 
LRF  0.998  0.997  0.847  0.839 
PLNN  1.000  0.999  0.907  0.868 
LRNS  0.772  0.776  0.711  0.698 
LRNSF  0.989  0.989  0.782  0.791 
PLNNNS  1.000  0.999  0.894  0.867 
As shown in Figures 4 and 5, the decision features of PLNNNS highlight similar image parts as LRNS and LRNSF, and are much easier to understand than the decision features of PLNN. In particular, as shown in Figure 5, the decision features of PLNNNS clearly highlight the collar and breast of Coat, and the shoulder of Pullover, which are much easier to understand than the cluttered features of PLNN. These results demonstrate the effectiveness of nonnegative and sparse constraints in selecting meaningful features. Moreover, the decision features of PLNNNS capture more details than LRNS and LRNSF, thus PLNNNS achieves a comparable accuracy with PLNN, and significantly outperforms the accuracy of LRNS and LRNSF on FMNIST2.
In summary, the decision features of LLCs are easy to understand, and the nonnegative and sparse constraints are highly effective in improving the interpretability of the decision features of LLCs.
5.4. Are PBFs of LLCs Easy to Understand?
The polytope boundary features (PBFs) of polytope boundaries (PBs) interpret why an instance is contained in the convex polytope of a LLC. In this subsection, we systematically study the semantical meaning of PBFs. Limited by space, we only use the PLNNNS models trained on FMNIST1 and FMINST2 as the target model to interpret. The LLCs of PLNNNS are computed by .
Recall that a PB is defined by a linear inequality , where the PBFs are the coefficients of in . Since the activation function is ReLU, is either or . Since the values of PBFs are nonnegative for PLNNNS, for a convex polytope , if , then the images in strongly correlate with the PBFs of ; if , then the images in are not strongly correlated with the PBFs of .
The above analysis of PBs and PBFs is demonstrated by the results in Tables 6 and 7, and Figure 6. Take the first convex polytope in Table 6 as an example, the PBs are and , whose PBFs in Figures 6(b)(c) show the features of Ankle Boot and Bag, respectively. Therefore, the convex polytope contains images of both Ankle Boot and Bag. A careful study of the other results suggests that the PBFs of the convex polytopes are easy to understand and accurately describe the images in each convex polytope.
We can also see that the PBFs in Figure 6 look similar to the decision features of PLNNNS in Figures 4 and 5. This shows the strong correlation between the features learned by different neurons of PLNNNS, which is probably caused by the hierarchy network structure. Due to the strong correlation between neurons, the number of configurations in is much less than , as shown in Table 3.
Surprisingly, as shown in Table 7, the top1 convex polytope on FMNIST2 contains more than 98% of the training instances. On these instances, the training accuracy of LLC is much higher than the training accuracies of LRNS and LRNSF. This means that the training instances in the top1 convex polytope are much easier to be linearly separated than all training instances in FMNIST2. From this perspective, the behavior of PLNNNS is like a “divide and conquer” strategy, which set aside a small proportion of instances that hinder the classification accuracy such that the majority of the instances can be better separated by a LLC. As shown by the top2 and top3 convex polytopes in Table 7, the set aside instances are grouped in their own convex polytopes, where the corresponding LLCs also achieve a very high accuracy. Table 6 shows similar phenomenon on FMNIST1. However, since the instances in FMNIST1 are easy to be linearly separated, the training accuracy of PLNNNS marginally outperforms LRNS and LRNSF.
5.5. Can We Hack a Model Using OpenBox?
Knowing what an intelligent machine “thinks” provides us the privilege to “hack” it. Here, to hack a target model is to significantly change its prediction on an instance by modifying as few features of as possible. In general, the biggest change of prediction is achieved by modifying the most important decision features. A more precise interpretation on the target model reveals the important decision features more accurately, thus requires to modify less features to achieve a bigger change of prediction. Following this idea, we apply LIME and to hack PLNNNS, and compare the quality of their interpretations by comparing the change of PLNNNS’s prediction when modifying the same number of decision features.
For an instance , denote by the decision features for the classification of . We hack PLNNNS by setting the values of a few topweighted decision features in to zero, such that the prediction of PLNNNS on changes significantly. The change of prediction is evaluated by two measures as follows. First, the change of prediction probability (CPP) is the absolute change of the probability of classifying as a positive instance. Second, the number of labelchanged instance (NLCI) is the number of instances whose predicted label changes after being hacked. Again, due to the inefficiency of LIME, we use the sampled data sets in Section 5.2 for evaluation.
In Figure 7, the average CPP and NLCI of are always higher than LIME on both data sets. This demonstrates that the interpretations computed by are more effective than LIME when they are applied to hack the target model.
Interestingly, the advantage of is more significant on FMNIST1 than on FMNIST2. This is because, as shown in Figure 2(a), the prediction probabilities of most instances in FMNIST1 are either 1.0 or 0.0, which provides little gradient information for LIME to accurately approximate the classification function of the PLNNNS. In this case, the decision features computed by LIME cannot describe the exact behavior of the target model.
In summary, since produces the exact and consistent interpretations for a target model, it achieves an advanced hacking performance over LIME.
CP  #Ankle Boot  #Bag  Accuracy  
1  /  /  3,991  3,997  0.999  
2  /  9  0  1.000  
3  /  /  0  3  1.000 
CP  #Coat  #Pullover  Accuracy  
1  3,932  3,942  0.894  
2  32  10  0.905  
3  18  0  0.944 
5.6. Can We Debug a Model Using OpenBox?
Intelligent machines are not perfect and predictions fail occasionally. When such failure occurs, we can apply to interpret why an instance is misclassified.
Figure 8 shows some images that are misclassified by PLNNNS with a high probability. In Figures 8(a)(c), the original image is a Coat, however, since the scattered mosaic pattern on the cloth hits more features of Pullover than Coat, the original image is classified as a Pullover with a high probability. In Figures 8(d)(f), the original image is a Pullover, however, it is misclassified as a Coat because the white collar and breast hit the typical features of Coat, and the dark shoulder and sleeves miss the most significant features of Pullover. Similarly, the Ankle Boot in Figure 8(g) highlights more features on the upper left corner, thus it is misclassified as a Bag. The Bag in Figure 8(j) is misclassified as an Ankle Boot because it hits the features of ankle and heel of Ankle Boot, however, misses the typical features of Bag on the upper left corner.
In conclusion, as demonstrated by the misclassified examples in Figure 8, accurately interprets the misclassifications, which is potentially useful in debugging abnormal behaviors of the interpreted model.
6. Conclusions and Future Work
In this paper, we tackle the challenging problem of interpreting PLNNs. By studying the states of hidden neurons and the configuration of a PLNN, we prove that a PLNN is mathematically equivalent to a set of LLCs, which can be efficiently computed by the proposed method. Extensive experiments show that the decision features and the polytope boundary features of LLCs provide exact and consistent interpretations on the overall behavior of a PLNN. Such interpretations are highly effective in hacking and debugging PLNN models. As future work, we will extend our work to interpret more general neural networks that adopt smooth activation functions, such as sigmoid and .
References
 (1)
 Agrawal et al. (2016) Aishwarya Agrawal, Dhruv Batra, and Devi Parikh. 2016. Analyzing the behavior of visual question answering models. arXiv:1606.07356 (2016).
 Ba and Caruana (2014) Jimmy Ba and Rich Caruana. 2014. Do deep nets really need to be deep?. In NIPS. 2654–2662.
 Bastani et al. (2017) Osbert Bastani, Carolyn Kim, and Hamsa Bastani. 2017. Interpreting Blackbox Models via Model Extraction. arXiv:1705.08504 (2017).
 Bishop (2007) C Bishop. 2007. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, New York (2007).
 Cao et al. (2015) C. Cao, X. Liu, Y Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, et al. 2015. Look and think twice: Capturing topdown visual attention with feedback convolutional neural networks. In ICCV. 2956–2964.
 Caron et al. (1989) RJ Caron, JF McDonald, and CM Ponic. 1989. A degenerate extreme point strategy for the classification of linear constraints as redundant or necessary. JOTA 62, 2 (1989), 225–237.
 Che et al. (2015) Z. Che, S. Purushotham, R. Khemani, and Y. Liu. 2015. Distilling knowledge from deep networks with applications to healthcare domain. arXiv:1512.03542 (2015).
 Chorowski and Zurada (2015) Jan Chorowski and Jacek M Zurada. 2015. Learning understandable neural networks with nonnegative weight constraints. TNNLS 26, 1 (2015), 62–69.
 Dosovitskiy and Brox (2016) Alexey Dosovitskiy and Thomas Brox. 2016. Inverting visual representations with convolutional networks. In CVPR. 4829–4837.
 Erhan et al. (2009) D. Erhan, Yoshua Bengio, A. Courville, and P. Vincent. 2009. Visualizing higherlayer features of a deep network. University of Montreal 1341 (2009), 3.
 Fong and Vedaldi (2017) Ruth Fong and Andrea Vedaldi. 2017. Interpretable Explanations of Black Boxes by Meaningful Perturbation. arXiv:1704.03296 (2017).
 Frosst and Hinton (2017) Nicholas Frosst and Geoffrey Hinton. 2017. Distilling a Neural Network Into a Soft Decision Tree. arXiv:1711.09784 (2017).
 Ghorbani et al. (2017) Amirata Ghorbani, Abubakar Abid, and James Zou. 2017. Interpretation of Neural Networks is Fragile. arXiv:1710.10547 (2017).
 Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In ICAIS. 315–323.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.
 Goodfellow et al. (2013) Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013. Maxout networks. arXiv:1302.4389 (2013).
 Goodman and Flaxman (2016) B. Goodman and S. Flaxman. 2016. European Union regulations on algorithmic decisionmaking and a” right to explanation”. arXiv:1606.08813 (2016).
 Harvey et al. (2017) Nick Harvey, Chris Liaw, and Abbas Mehrabian. 2017. Nearlytight VCdimension bounds for piecewise linear neural networks. arXiv:1703.02930 (2017).

He
et al. (2015)
K. He, X. Zhang,
S. Ren, and J. Sun.
2015.
Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In
ICCV. 1026–1034.  Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv:1503.02531 (2015).
 Hoyer (2002) Patrik O Hoyer. 2002. Nonnegative sparse coding. In WNNSP. 557–565.
 Kindermans et al. (2017) PieterJan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T Schütt, Sven Dähne, Dumitru Erhan, and Been Kim. 2017. The (Un) reliability of saliency methods. arXiv:1711.00867 (2017).
 Koh and Liang (2017) Pang Wei Koh and Percy Liang. 2017. Understanding blackbox predictions via influence functions. arXiv:1703.04730 (2017).
 Koiran and Sontag (1996) Pascal Koiran and Eduardo D Sontag. 1996. Neural networks with quadratic VC dimension. In NIPS. 197–203.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In NIPS. 1097–1105.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436.
 Lee et al. (2007) Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. 2007. Efficient sparse coding algorithms. In NIPS. 801–808.
 Li et al. (2015) Jiwei Li, Xinlei Chen, Eduard Hovy, and Dan Jurafsky. 2015. Visualizing and understanding neural models in NLP. arXiv:1506.01066 (2015).
 Mahendran and Vedaldi (2015) Aravindh Mahendran and Andrea Vedaldi. 2015. Understanding deep image representations by inverting them. In CVPR. 5188–5196.
 Montufar et al. (2014) Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. 2014. On the number of linear regions of deep neural networks. In NIPS. 2924–2932.
 Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In ICML. 807–814.
 Ngiam et al. (2011) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011. Multimodal deep learning. In ICML. 689–696.
 Palm (2012) R. B. Palm. 2012. Prediction as a candidate for learning deep hierarchical models of data. (2012).
 Pascanu et al. (2013) Razvan Pascanu, Guido Montufar, and Yoshua Bengio. 2013. On the number of response regions of deep feed forward networks with piecewise linear activations. arXiv:1312.6098 (2013).
 Rather et al. (2017) Nadeem N Rather, Chintan O Patel, and Sharib A Khan. 2017. Using Deep Learning Towards Biomedical Knowledge Discovery. IJMSC 3, 2 (2017), 1.
 Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In KDD. ACM, 1135–1144.
 Selvaraju et al. (2016) R. R Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra. 2016. Gradcam: Why did you say that? visual explanations from deep networks via gradientbased localization. arXiv:1610.02391 (2016).
 Shrikumar et al. (2017) A. Shrikumar, P. Greenside, and A. Kundaje. 2017. Learning important features through propagating activation differences. arXiv:1704.02685 (2017).
 Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv:1312.6034 (2013).
 Smilkov et al. (2017) D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg. 2017. SmoothGrad: removing noise by adding noise. arXiv:1706.03825 (2017).
 Sontag (1998) Eduardo D Sontag. 1998. VC dimension of neural networks. NATO ASI Series F Computer and Systems Sciences 168 (1998), 69–96.
 Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. arXiv:1703.01365 (2017).
 Wu et al. (2018) M. Wu, M. C Hughes, S. Parbhoo, M. Zazzi, V. Roth, and F. DoshiVelez. 2018. Beyond Sparsity: Tree Regularization of Deep Models for Interpretability. AAAI (2018).
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. 2017. FashionMNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. (2017). arXiv:cs.LG/cs.LG/1708.07747
 Yosinski et al. (2015) J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. 2015. Understanding neural networks through deep visualization. arXiv:1506.06579 (2015).
 Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In ICML. 325–333.
 Zhou et al. (2017) Bolei Zhou, David Bau, Aude Oliva, and Antonio Torralba. 2017. Interpreting Deep Visual Representations via Network Dissection. arXiv:1711.05611 (2017).

Zhou et al. (2016)
Bolei Zhou, Aditya
Khosla, Agata Lapedriza, Aude Oliva,
and Antonio Torralba. 2016.
Learning deep features for discriminative localization. In
CVPR. 2921–2929.
Comments
There are no comments yet.