Exact and Consistent Interpretation for Piecewise Linear Neural Networks: A Closed Form Solution

02/17/2018 ∙ by Lingyang Chu, et al. ∙ HUAWEI Technologies Co., Ltd. Simon Fraser University 0

Strong intelligent machines powered by deep neural networks are increasingly deployed as black boxes to make decisions in risk-sensitive domains, such as finance and medical. To reduce potential risk and build trust with users, it is critical to interpret how such machines make their decisions. Existing works interpret a pre-trained neural network by analyzing hidden neurons, mimicking pre-trained models or approximating local predictions. However, these methods do not provide a guarantee on the exactness and consistency of their interpretation. In this paper, we propose an elegant closed form solution named OpenBox to compute exact and consistent interpretations for the family of Piecewise Linear Neural Networks (PLNN). The major idea is to first transform a PLNN into a mathematically equivalent set of linear classifiers, then interpret each linear classifier by the features that dominate its prediction. We further apply OpenBox to demonstrate the effectiveness of non-negative and sparse constraints on improving the interpretability of PLNNs. The extensive experiments on both synthetic and real world data sets clearly demonstrate the exactness and consistency of our interpretation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

More and more machine learning systems are making significant decisions routinely in important domains, such as medical practice, autonomous driving, criminal justice, and military decision making 

(Goodfellow et al., 2016). As the impact of machine-made decisions increases, the demand on clear interpretations of machine learning systems is growing ever stronger against the blind deployments of decision machines (Goodman and Flaxman, 2016). Accurately and reliably interpreting a machine learning model is the key to many significant tasks, such as identifying failure models (Agrawal et al., 2016), building trust with human users (Ribeiro et al., 2016), discovering new knowledge (Rather et al., 2017), and avoiding unfairness issues (Zemel et al., 2013).

The interpretation problem of machine learning models has been studied for decades. Conventional models, such as Logistic Regression and Support Vector Machine, have all been well interpreted from both practical and theoretical perspectives 

(Bishop, 2007)

. Powerful non-negative and sparse constraints are also developed to enhance the interpretability of conventional models by sparse feature selection 

(Lee et al., 2007; Hoyer, 2002). However, due to the complex network structure of a deep neural network, the interpretation problem of modern deep models is yet a challenging field that awaits further exploration.

As to be reviewed in Section 2, the existing studies interpret a deep neural network in three major ways. The hidden neuron analysis methods (Mahendran and Vedaldi, 2015; Yosinski et al., 2015; Ngiam et al., 2011; Dosovitskiy and Brox, 2016) analyze and visualize the features learned by the hidden neurons of a neural network; the model mimicking methods (Ba and Caruana, 2014; Che et al., 2015; Hinton et al., 2015; Bastani et al., 2017) build a transparent model to imitate the classification function of a deep neural network; the local explanation methods (Shrikumar et al., 2017; Fong and Vedaldi, 2017; Sundararajan et al., 2017; Smilkov et al., 2017) study the predictions on local perturbations of an input instance, so as to provide decision features for interpretation. All these methods gain useful insights into the mechanism of deep models. However, there is no guarantee that what they compute as an interpretation is truthfully the exact behavior of a deep neural network. As demonstrated by Ghorbani (Ghorbani et al., 2017), most existing interpretation methods are inconsistent and fragile, because two perceptively indistinguishable instances with the same prediction result can be easily manipulated to have dramatically different interpretations.

Can we compute an exact and consistent interpretation for a pre-trained deep neural network? In this paper, we provide an affirmative answer, as well as an elegant closed form solution for the family of piecewise linear neural networks. Here, a piecewise linear neural network (PLNN(Harvey et al., 2017)

is a neural network that adopts a piecewise linear activation function, such as MaxOut 

(Goodfellow et al., 2013)

and the family of ReLU 

(Glorot et al., 2011; Nair and Hinton, 2010; He et al., 2015). The wide applications (LeCun et al., 2015) and great practical successes (Krizhevsky et al., 2012) of PLNNs call for exact and consistent interpretations on the overall behaviour of this type of neural networks. We make the following technical contributions.

First, we prove that a PLNN is mathematically equivalent to a set of local linear classifiers, each of which being a linear classifier that classifies a group of instances within a convex polytope in the input space. Second, we propose a method named to provide an exact interpretation of a PLNN by computing its equivalent set of local linear classifiers in closed form. Third, we interpret the classification result of each instance by the decision features of its local linear classifier. Since all instances in the same convex polytope share the same local linear classifier, our interpretations are consistent per convex polytope. Fourth, we also apply to study the effect of non-negative and sparse constraints on the interpretability of PLNNs. We find that a PLNN trained with these constraints selects meaningful features that dramatically improve the interpretability. Last, we conduct extensive experiments on both synthetic and real-world data sets to verify the effectiveness of our method.

The rest of this paper is organized as follows. We review the related works in Section 2. We formulate the problem in Section 3 and present in Section 4. We report the experimental results in Section 5, and conclude the paper in Section 6.

2. Related Works

How to interpret the overall mechanism of deep neural networks is an emergent and challenging problem.

2.1. Hidden Neuron Analysis Methods

The hidden neuron analysis methods (Mahendran and Vedaldi, 2015; Yosinski et al., 2015; Ngiam et al., 2011; Dosovitskiy and Brox, 2016) interpret a pre-trained deep neural network by visualizing, revert-mapping or labeling the features that are learned by the hidden neurons.

Yosinski et al. (Yosinski et al., 2015) visualized the live activations of the hidden neurons of a ConvNet, and proposed a regularized optimization to produce a qualitatively better visualization. Erhan et al. (Erhan et al., 2009) proposed an activation maximization method and a unit sampling method to visualize the features learned by hidden neurons. Cao et al. (Cao et al., 2015) visualized a neural network’s attention on its target objects by a feedback loop that infers the activation status of the hidden neurons. Li et al. (Li et al., 2015)

visualized the compositionality of clauses by analyzing the outputs of hidden neurons in a neural model for Natural Language Processing.

To understand the features learned by the hidden neurons, Mahendran et al. (Mahendran and Vedaldi, 2015) proposed a general framework that revert-maps the features learned from an image to reconstruct the image. Dosovitskiy et al. (Dosovitskiy and Brox, 2016) performed the same task as Mahendran et al. (Mahendran and Vedaldi, 2015)

did by training an up-convolutional neural network.

Zhou et al. (Zhou et al., 2017) interpreted a CNN by labeling each hidden neuron with a best aligned human-understandable semantic concept. However, it is hard to get a golden dataset with accurate and complete labels of all human semantic concepts.

The hidden neuron analysis methods provide useful qualitative insights into the properties of each hidden neuron. However, qualitatively analyzing every neuron does not provide much actionable and quantitative interpretation about the overall mechanism of the entire neural network (Frosst and Hinton, 2017).

2.2. Model Mimicking Methods

By imitating the classification function of a neural network, the model mimicking methods (Ba and Caruana, 2014; Che et al., 2015; Hinton et al., 2015; Bastani et al., 2017) build a transparent model that is easy to interpret and achieves a high classification accuracy.

Ba et al. (Ba and Caruana, 2014) proposed a model compression method to train a shallow mimic network using the training instances labeled by one or more deep neural networks. Hinton et al. (Hinton et al., 2015)

proposed a distillation method that distills the knowledge of a large neural network by training a relatively smaller network to mimic the prediction probabilities of the original large network. To improve the interpretability of distilled knowledge, Frosst and Hinton 

(Frosst and Hinton, 2017) extended the distillation method (Hinton et al., 2015)

by training a soft decision tree to mimic the prediction probabilities of a deep neural network.

Che et al. (Che et al., 2015) proposed a mimic learning method to learn interpretable phenotype features. Wu et al. (Wu et al., 2018) proposed a tree regularization method that uses a binary decision tree to mimic and regularize the classification function of a deep time-series model.

The mimic models built by model mimicking methods are much simpler to interpret than deep neural networks. However, due to the reduced model complexity of a mimic model, there is no guarantee that a deep neural network with a large VC-dimension (Sontag, 1998; Koiran and Sontag, 1996; Harvey et al., 2017) can be successfully imitated by a simpler shallow model. Thus, there is always a gap between the interpretation of a mimic model and the actual overall mechanism of the target deep neural network.

2.3. Local Interpretation Methods

The local interpretation methods (Shrikumar et al., 2017; Fong and Vedaldi, 2017; Sundararajan et al., 2017; Smilkov et al., 2017) compute and visualize the important features for an input instance by analyzing the predictions of its local perturbations.

Simonyan et al. (Simonyan et al., 2013) generated a class-representative image and a class-saliency map for each class of images by computing the gradient of the class score with respect to an input image. Ribeiro et al. (Ribeiro et al., 2016) proposed LIME to interpret the predictions of any classifier by learning an interpretable model in the local region around the input instance.

Zhou et al. (Zhou et al., 2016) proposed CAM to identify discriminative image regions for each class of images using the global average pooling in CNNs. Selvaraju et al. (Selvaraju et al., 2016) generalized CAM (Zhou et al., 2016) by Grad-CAM, which identifies important regions of an image by flowing class-specific gradients into the final convolutional layer of a CNN.

Koh et al. (Koh and Liang, 2017) used influence functions to trace a model’s prediction and identify the training instances that are the most responsible for the prediction.

The local interpretation methods generate an insightful individual interpretation for each input instance. However, the interpretations for perspectively indistinguishable instances may not be consistent (Ghorbani et al., 2017), and can be purposefully manipulated by a simple transformation of the input instance without affecting the prediction result (Kindermans et al., 2017).

3. Problem Definition

For a PLNN that contains layers of neurons, we write the -th layer of as . Hence, is the input layer, is the output layer, and the other layers , are hidden layers. A neuron in a hidden layer is called a hidden neuron. Let represent the number of neurons in , the total number of hidden neurons in is computed by .

Denote by the -th neuron in , by its bias, by its output, and by the total weighted sum of its inputs. For all the neurons in , we write their biases as a vector , their outputs as a vector , and their inputs as a vector .

Neurons in successive layers are connected by weighted edges. Denote by the weight of the edge between the -th neuron in and the -th neuron in , that is, is an -by- matrix. For , we compute by

(1)

Denote by the piecewise linear activation function for each neuron in the hidden layers of . We have for all . We extend to apply to vectors in an element-wise fashion, such that . Then, we compute for all by

(2)

An input instance of is denoted by , where is a -dimensional input space. is also called an instance for short.

Denote by the -th dimension of . The input layer contains neurons, where for all . That is, .

The output of is , where is an -dimensional output space. The output layer adopts the softmax function to compute the output by .

Notation Description
The -th neuron in layer .
The number of neurons in layer .
The total number of hidden neurons in .
The input of the -th neuron in layer .
The configuration of the -th neuron in layer .
The -th configuration of the PLNN .
The -th convex polytope determined by .
The -th linear classifier that is determined by .
The set of linear inequalities that define .
Table 1. Frequently used notations.

A PLNN works as a classification function that maps an input to an output . It is widely known that is a piecewise linear function (Pascanu et al., 2013; Montufar et al., 2014). However, due to the complex network of a PLNN, the overall behaviour of is hard to understand. Thus, a PLNN is usually regarded as a black box.

How to interpret the overall behavior of a PLNN in a human-understandable manner is an interesting problem that has attracted much attention in recent years.

Following a principled approach of interpreting a machine learning model (Bishop, 2007), we regard an interpretation of a PLNN as the decision features that define the decision boundary of . We call a model interpretable if it explicitly provides its interpretation (i.e., decision features) in closed form.

Definition 3.1 ().

Given a fixed PLNN with constant structure and parameters, our task is to interpret the overall behaviour of by computing an interpretable model that satisfies the following requirements.

  • Exactness: is mathematically equivalent to such that the interpretations provided by truthfully describe the exact behaviour of .

  • Consistency: provides similar interpretations for classification of similar instances.

Table 1 summarizes a list of frequently used notations.

4. The OpenBox Method

In this section, we describe the method, which produces an exact and consistent interpretation of a PLNN by computing an interpretation model in a piecewise linear closed form.

We first define the configuration of a PLNN , which specifies the activation status of each hidden neuron in . Then, we illustrate how to interpret the classification result of a fixed instance. Last, we illustrate how to interpret the overall behavior of by computing an interpretation model that is mathematically equivalent to .

4.1. The Configuration of a PLNN

For a hidden neuron , the piecewise linear activation function is in the following form.

(3)

where is a constant integer, consists of linear functions, are constant slopes, are constant intercepts, and is a collection of constant real intervals that partition .

Given a fixed PLNN , an instance determines the value of , and further determines a linear function in to apply. According to which linear function in is applied, we encode the activation status of each hidden neuron by states, each of which uniquely corresponds to one of the linear functions of . Denote by the state of , we have if and only if (). Since the inputs ’s are different from neuron to neuron, the states of different hidden neurons may differ from each other.

Denote by a vector the states of all hidden neurons in . The configuration of is an -dimensional vector, denoted by , which specifies the states of all hidden neurons in .

The configuration of a fixed PLNN is uniquely determined by the instance . We write the function that maps an instance to a configuration as .

For a neuron , denote by variables and the slope and intercept, respectively, of the linear function that corresponds to the state . and are uniquely determined by , such that and , if and only if ().

For all hidden neurons in , we write the variables of slopes and intercepts as and , respectively. Then, we rewrite the activation function for all neurons in a hidden layer as

(4)

where is the Hadamard product between and .

Next, we interpret the classification result of a fixed instance.

4.2. Exact Interpretation for the Classification Result of a Fixed Instance

Given a fixed PLNN , we interpret the classification result of a fixed instance by deriving the closed form of as follows.

Following Equations 2 and 4, we have, for all

By plugging into Equation 1, we rewrite as

(5)

where , and is an extended version of Hadamard product, such that the entry at the -th row and -th column of is .

By iteratively plugging Equation 5 into itself, we can write for all as

By plugging and into the above equation, we rewrite , for all , as

(6)

where is the coefficient matrix of , and is the sum of the remaining terms. The superscript indicates that is equivalent to PLNN’s forward propagation from layer to layer .

Since the output of on an input is , the closed form of is

(7)

For a fixed PLNN and a fixed instance , and are constant parameters uniquely determined by the fixed configuration . Therefore, for a fixed input instance , is a linear classifier whose decision boundary is explicitly defined by .

Inspired by the interpretation method widely used by conventional linear classifiers, such as Logistic Regression and linear SVM (Bishop, 2007), we interpret the prediction on a fixed instance by the decision features of . Specifically, the entries of the -th row of are the decision features for the -th class of instances.

Equation 7 provides a straightforward way to interpret the classification result of a fixed instance. However, individually interpreting the classification result of every single instance is far from the understanding of the overall behavior of a PLNN . Next, we describe how to interpret the overall behavior of by computing an interpretation model that is mathematically equivalent to .

4.3. Exact Interpretation of a PLNN

A fixed PLNN with hidden neurons has at most configurations. We represent the -th configuration by , where is the set of all configurations of .

Recall that each instance uniquely determines a configuration . Since the volume of , denoted by , is at most , but the number of instances in can be arbitrarily large, it is clear that at least one configuration in should be shared by more than one instances in .

Denote by the set of instances that have the same configuration . We prove in Theorem 4.1 that for any configuration , is a convex polytope in .

Theorem 4.1 ().

Given a fixed PLNN with hidden neurons, , is a convex polytope in .

Proof.

We prove by showing that is equivalent to a finite set of linear inequalities with respect to .

When , we have . For , it follows Equation 6 that , which is a linear function of , because and are constant parameters when is fixed. In summary, given a fixed , is a linear function of for all .

We show that is a convex polytope by showing that is equivalent to a set of linear inequalities with respect to . Recall that if and only if (). Denote by the bijective function that maps a configuration to a real interval in , such that if and only if (). Then, is equivalent to a set of constraints, denoted by . Since is a linear function of and is a real interval, each constraint in is equivalent to two linear inequalities with respect to . Therefore, is equivalent to a set of linear inequalities, which means is a convex polytope. ∎

According to Theorem 4.1, all instances sharing the same configuration form a unique convex polytope that is explicitly defined by linear inequalities in . Since also determines the linear classifier for a fixed instance in Equation 7, all instances in the same convex polytope share the same linear classifier determined by .

Input: a fixed PLNN, the set of training instances used to train .
Output: a set of active LLCs
1:  Initialization: , .
2:  for each  do
3:     Compute the configuration by .
4:     if  then
5:         and .
6:     end if
7:  end for
8:  return  .
Algorithm 1

Denote by the linear classifier that is shared by all instances in , we can interpret as a set of local linear classifiers (LLCs), each LLC being a linear classifier that applies to all instances in a convex polytope . Denote by a tuple the -th LLC, a fixed PLNN is equivalent to a set of LLCs, denoted by . We use as our final interpretation model for .

For a fixed PLNN , if the states of the hidden neurons are independent, the PLNN has configurations, which means contains LLCs. However, due to the hierarchical structure of a PLNN, the states of a hidden neuron in strongly correlate with the states of the neurons in the former layers . Therefore, the volume of is much less than , and the number of local linear classifiers in is much less than . We discuss this phenomenon later in Table 3 and Section 5.4.

In practice, we do not need to compute the entire set of LLCs in all at once. Instead, we can first compute an active subset of , that is, the set of LLCs that are actually used to classify the available set of instances. Then, we can update whenever a new LLC is used to classify a newly coming instance.

Algorithm 1 summarizes the method, which computes as the active set of LLCs that are actually used to classify the set of training instances, denoted by .

Now, we are ready to introduce how to interpret the classification result of an instance . First, we interpret the classification result of using the decision features of (Section 4.2). Second, we interpret why is contained in using the polytope boundary features (PBFs), which are the decision features of the polytope boundaries. More specifically, a polytope boundary of is defined by a linear inequality in . By Equation 6, is a linear function with respect to . The PBFs are the coefficients of in .

We also discover that some linear inequalities in

are redundant whose hyperplanes do not intersect with

. To simplify our interpretation on the polytope boundaries, we remove such redundant inequalities by Caron’s method (Caron et al., 1989) and focus on studying the PBFs of the non-redundant ones.

The advantages of are three-fold as follows. First, our interpretation is exact, because the set of LLCs in are mathematically equivalent to the classification function of . Second, our interpretation is group-wise consistent. It is due to the reason that all instances in the same convex polytope are classified by exactly the same LLC, and thus the interpretations are consistent with respect to a given convex polytope. Last, our interpretation is easy to compute, since computes by a one-time forward propagation through for each instance in .

5. Experiments

In this section, we evaluate the performance of , and compare it with the state-of-the-art method LIME (Ribeiro et al., 2016). In particular, we address the following questions: (1) What are the LLCs look like? (2) Are the interpretations produced by LIME and exact and consistent? (3) Are the decision features of LLCs easy to understand, and can we improve the interpretability of these features by non-negative and sparse constraints? (4) How to interpret the PBFs of LLCs? (5) How effective are the interpretations of in hacking and debugging a PLNN model?

Table 2 shows the details of the six models we used. For both PLNN and PLNN-NS, we use the same network structure described in Table 3, and adopt the widely used activation function: ReLU (Glorot et al., 2011). We apply the non-negative and sparse constraints proposed by Chorowski et al. (Chorowski and Zurada, 2015) to train PLNN-NS. Since our goal is to comprehensively study the interpretation effectiveness of rather than achieving state-of-the-art classification performance, we use relatively simple network structures for PLNN and PLNN-NS, which are still powerful enough to achieve significantly better classification performance than Logistic Regression (LR). The decision features of LR, LR-F, LR-NS and LR-NSF are used as baselines to compare with the decision features of LLCs.

The Python code of LIME is published by its authors111https://github.com/marcotcr/lime. The other methods and models are implemented in Matlab. PLNN and PLNN-NS are trained using the DeepLearnToolBox (Palm, 2012). All experiments are conducted on a PC with a Core-i7-3370 CPU (3.40 GHz), 16GB main memory, and a 5,400 rpm hard drive running Windows 7 OS.

We use the following data sets. Detailed information of the data sets is shown in Table 4.

Synthetic (SYN) Data Set. As shown in Figure 1(a), this data set contains 20,000 instances uniformly sampled from a quadrangle in 2-dimensional Euclidean space. The red and blue points are positive and negative instances, respectively. We use all instances in SYN as training data to visualize the LLCs of a PLNN.

Models PLNN PLNN-NS LR LR-F LR-NS LR-NSF
NS
Flip
Table 2. The models to interpret. LR is Logistic Regression. NS means non-negative and sparse constraints. Flip means the model is trained on the instances with flipped labels.
Data Sets # Neurons PLNN PLNN-NS
SYN
FMNIST-1
FMNIST-2
Table 3. The network structures and the number of configurations of PLNN and PLNN-NS. The neurons in successive layers are initialized to be fully connected. is the number of linear functions of ReLU, is the number of hidden neurons.
Data Sets Training Data Testing Data
# Positive # Negative # Positive # Negative
SYN 6,961 13,039 N/A N/A
FMNIST-1 4,000 4,000 3,000 3,000
FMNIST-2 4,000 4,000 3,000 3,000
Table 4. Detailed description of data sets.
(a) training data of SYN
(b) prediction results of PLNN
(c) convex polytopes
(d) LLCs
Figure 1. The LLCs of the PLNN trained on SYN.

FMNIST-1 and FMNIST-2 Data Sets. Each of these data sets contains two classes of images in the Fashion MNIST data set (Xiao et al., 2017). FMNIST-1 consists of the images of Ankle Boot and Bag. FMNIST-2 consists of the images of Coat and Pullover. All images in FMNIST-1 and FMNIST-2 are 28-by-28 grayscale images. We represent an image by cascading the 784 pixel values into a 784-dimensional feature vector. The Fashion MNIST data set is available online222https://github.com/zalandoresearch/fashion-mnist.

5.1. What Are the LLCs Look Like?

We demonstrate our claim in Theorem 4.1 by visualizing the LLCs of the PLNN trained on SYN.

Figures 1(a)-(b) show the training instances of SYN and the prediction results of PLNN, respectively. Since all instances are used for training, the prediction accuracy is 99.9%.

In Figure 1(c), we plot all instances with the same configuration in the same colour. Clearly, all instances with the same configuration are contained in the same convex polytope. This demonstrates our claim in Theorem 4.1.

Figure 1(d) shows the LLCs whose convex polytopes cover the decision boundary of PLNN and contain both positive and negative instances. As it is shown, the solid lines show the decision boundaries of the LLCs, which capture the difference between positive and negative instances, and form the overall decision boundary of PLNN. A convex polytope that does not cover the boundary of PLNN contains a single class of instances. The LLCs of these convex polytopes capture the common features of the corresponding class of instances. As to be analyzed in the following subsections, the set of LLCs produce exactly the same prediction as PLNN, and also capture meaningful decision features that are easy to understand.

5.2. Are the Interpretations Exact and Consistent?

Exact and consistent interpretations are naturally favored by human minds. In this subsection, we systematically study the exactness and consistency of the interpretations of LIME and on FMNIST-1 and FMNIST-2. Since LIME is too slow to process all instances in 24 hours, for each of FMNIST-1 and FMNIST-2, we uniformly sample 600 instances from the testing set, and conduct the following experiments on the sampled instances.

We first analyze the exactness of interpretation by comparing the predictions computed by the local interpretable model of LIME, the LLCs of and PLNN, respectively. The prediction of an instance is the probability of classifying it as a positive instance.

In Figure 2, since LIME does not guarantee zero approximation error on the local predictions of PLNN, the predictions of LIME are not exactly the same as PLNN on FMNIST-1, and are dramatically different from PLNN on FMNIST-2. The difference of predictions is more significant on FMNIST-2, because the images in FMNIST-2 are more difficult to distinguish, which makes the decision boundary of PLNN more complicated and harder to approximate. We can also see that the predictions of LIME exceed . This is because the output of the interpretable model of LIME is not a probability at all. As a result, it is arguable that the interpretations computed by LIME may not truthfully describe the exact behavior of PLNN. In contrast, since the set of LLCs computed by is mathematically equivalent to of PLNN, the predictions of are exactly the same as PLNN on all instances. Therefore, the decision features of LLCs exactly describe the overall behavior of PLNN.

Next, we study the interpretation consistency of LIME and by analyzing the similarity between the interpretations of similar instances.

In general, a consistent interpretation method should provide similar interpretations for similar instances. For an instance , denote by the nearest neighbor of by Euclidean distance, by the decision features for the classification of and

, respectively. We measure the consistency of interpretation by the cosine similarity between

and , where a larger cosine similarity indicates a better interpretation consistency.

As shown in Figure 3, the cosine similarity of is equal to 1 on about 50% of the instances, because consistently gives the same interpretation for all instances in the same convex polytope. Since the nearest neighbours and may not belong to the same convex polytope, the cosine similarity of is not always equal to 1 on all instances. In constrast, since LIME computes individual interpretation based on the unique local perturbations of every single instance, the cosine similarity of LIME is significantly lower than on all instances. This demonstrates the superior interpretation consistency of .

In summary, the interpretations of are exact, and are much more consistent than the interpretations of LIME.

(a) FMNIST-1
(b) FMNIST-2
Figure 2. The predictions of LIME, OpenBox and PLNN. We sort the results by PLNN’s predictions in descending order.
(a) FMNIST-1
(b) FMNIST-2
Figure 3. The cosine similarity (CS) between the decision features of each instance and its nearest neighbour. The results of LIME and are separately sorted by cosine similarity in descending order.

5.3. Decision Features of LLCs and the Effect of Non-negative and Sparse Constraints

Besides exactness and consistency, a good interpretation should also have a strong semantical meaning, such that the “thoughts” of an intelligent machine can be easily understood by a human brain. In this subsection, we first show the meaning of the decision features of LLCs, then study the effect of the non-negative and sparse constraints in improving the interpretability of the decision features. The decision features of PLNN and PLNN-NS are computed by . The decision features of LR, LR-F, LR-NS and LR-NSF are used as baselines. Table 5 shows the accuracy of all models.

Figure 4 shows the decision features of all models on FMNIST-1. Interestingly, the decision features of PLNN are as easy to understand as the decision features of LR and LR-F. All these features clearly highlight meaningful image parts, such as the ankle and heel of Ankle Boot, and the upper left corner of Bag. A closer look at the the average images suggests that these decision features describe the difference between Ankle Boot and Bag.

The decision features of PLNN capture more detailed difference between Ankle Boot and Bag than the decision features of LR and LR-F. This is because the LLCs of PLNN only capture the difference between a subset of instances within a convex polytope, however, LR and LR-F capture the overall difference between all instances of Ankle Boot and Bag. The accuracies of PLNN, LR and LR-F are comparable because the instances of Ankle Boot and Bag are easy to distinguish. However, as to be shown in Figure 5, when the instances are hard to distinguish, PLNN captures much more detailed features than LR and LR-F, and achieves a significantly better accuracy.

(a) Avg. Image (b) LR (c) LR-NS (d) PLNN (e) PLNN-NS (f) Avg. Image (g) LR-F (h) LR-NSF (i) PLNN (j) PLNN-NS
Figure 4. The decision features of all models on FMNIST-1. (a)-(e) and (f)-(j) show the average image and the decision features of all models for Ankle Boot and Bag, respectively. For PLNN and PLNN-NS, we show the decision features of the LLC whose convex polytope contains the most instances.
(a) Avg. Image (b) LR (c) LR-NS (d) PLNN (e) PLNN-NS (f) Avg. Image (g) LR-F (h) LR-NSF (i) PLNN (j) PLNN-NS
Figure 5. The decision features of all models on FMNIST-2. (a)-(e) and (f)-(j) show the average image and the decision features of all models for Coat and Pullover, respectively. For PLNN and PLNN-NS, we show the decision features of the LLC whose convex polytope contains the most instances.

Figure 5 shows the decision features of all models on FMNIST-2. As it is shown, LR and LR-F capture decision features with a strong semantical meaning, such as the collar and breast of Coat, and the shoulder of Pullover. However, these features are too general to accurately distinguish between Coat and Pullover. Therefore, LR and LR-F do not achieve a high accuracy. Interestingly, the decision features of PLNN capture much more details than LR and LR-F, which leads to the superior accuracy of PLNN.

The superior accuracy of PLNN comes at the cost of cluttered decision features that may be hard to understand. Fortunately, applying non-negative and sparse constraints on PLNN effectively improves the interpretability of the decision features without affecting the classification accuracy.

Data Set FMNIST-1 FMNIST-2
Accuracy Train Test Train Test
LR 0.998 0.997 0.847 0.839
LR-F 0.998 0.997 0.847 0.839
PLNN 1.000 0.999 0.907 0.868
LR-NS 0.772 0.776 0.711 0.698
LR-NSF 0.989 0.989 0.782 0.791
PLNN-NS 1.000 0.999 0.894 0.867
Table 5. The training and testing accuracy of all models.

As shown in Figures 4 and 5, the decision features of PLNN-NS highlight similar image parts as LR-NS and LR-NSF, and are much easier to understand than the decision features of PLNN. In particular, as shown in Figure 5, the decision features of PLNN-NS clearly highlight the collar and breast of Coat, and the shoulder of Pullover, which are much easier to understand than the cluttered features of PLNN. These results demonstrate the effectiveness of non-negative and sparse constraints in selecting meaningful features. Moreover, the decision features of PLNN-NS capture more details than LR-NS and LR-NSF, thus PLNN-NS achieves a comparable accuracy with PLNN, and significantly outperforms the accuracy of LR-NS and LR-NSF on FMNIST-2.

In summary, the decision features of LLCs are easy to understand, and the non-negative and sparse constraints are highly effective in improving the interpretability of the decision features of LLCs.

5.4. Are PBFs of LLCs Easy to Understand?

The polytope boundary features (PBFs) of polytope boundaries (PBs) interpret why an instance is contained in the convex polytope of a LLC. In this subsection, we systematically study the semantical meaning of PBFs. Limited by space, we only use the PLNN-NS models trained on FMNIST-1 and FMINST-2 as the target model to interpret. The LLCs of PLNN-NS are computed by .

Recall that a PB is defined by a linear inequality , where the PBFs are the coefficients of in . Since the activation function is ReLU, is either or . Since the values of PBFs are non-negative for PLNN-NS, for a convex polytope , if , then the images in strongly correlate with the PBFs of ; if , then the images in are not strongly correlated with the PBFs of .

The above analysis of PBs and PBFs is demonstrated by the results in Tables 6 and 7, and Figure 6. Take the first convex polytope in Table 6 as an example, the PBs are and , whose PBFs in Figures 6(b)-(c) show the features of Ankle Boot and Bag, respectively. Therefore, the convex polytope contains images of both Ankle Boot and Bag. A careful study of the other results suggests that the PBFs of the convex polytopes are easy to understand and accurately describe the images in each convex polytope.

We can also see that the PBFs in Figure 6 look similar to the decision features of PLNN-NS in Figures 4 and 5. This shows the strong correlation between the features learned by different neurons of PLNN-NS, which is probably caused by the hierarchy network structure. Due to the strong correlation between neurons, the number of configurations in is much less than , as shown in Table 3.

Surprisingly, as shown in Table 7, the top-1 convex polytope on FMNIST-2 contains more than 98% of the training instances. On these instances, the training accuracy of LLC is much higher than the training accuracies of LR-NS and LR-NSF. This means that the training instances in the top-1 convex polytope are much easier to be linearly separated than all training instances in FMNIST-2. From this perspective, the behavior of PLNN-NS is like a “divide and conquer” strategy, which set aside a small proportion of instances that hinder the classification accuracy such that the majority of the instances can be better separated by a LLC. As shown by the top-2 and top-3 convex polytopes in Table 7, the set aside instances are grouped in their own convex polytopes, where the corresponding LLCs also achieve a very high accuracy. Table 6 shows similar phenomenon on FMNIST-1. However, since the instances in FMNIST-1 are easy to be linearly separated, the training accuracy of PLNN-NS marginally outperforms LR-NS and LR-NSF.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 6. (a)-(d) show the PBFs of the PLNN-NS on FMNIST-1. (e)-(h) show the PBFs of the PLNN-NS on FMNIST-2.
(a) FMNIST-1
(b) FMNIST-2
(c) FMNIST-1
(d) FMNIST-2
Figure 7. The hacking performance of LIME and . (a)-(b) show the average (Avg.) CPP. (c)-(d) show the NLCI.

5.5. Can We Hack a Model Using OpenBox?

Knowing what an intelligent machine “thinks” provides us the privilege to “hack” it. Here, to hack a target model is to significantly change its prediction on an instance by modifying as few features of as possible. In general, the biggest change of prediction is achieved by modifying the most important decision features. A more precise interpretation on the target model reveals the important decision features more accurately, thus requires to modify less features to achieve a bigger change of prediction. Following this idea, we apply LIME and to hack PLNN-NS, and compare the quality of their interpretations by comparing the change of PLNN-NS’s prediction when modifying the same number of decision features.

For an instance , denote by the decision features for the classification of . We hack PLNN-NS by setting the values of a few top-weighted decision features in to zero, such that the prediction of PLNN-NS on changes significantly. The change of prediction is evaluated by two measures as follows. First, the change of prediction probability (CPP) is the absolute change of the probability of classifying as a positive instance. Second, the number of label-changed instance (NLCI) is the number of instances whose predicted label changes after being hacked. Again, due to the inefficiency of LIME, we use the sampled data sets in Section 5.2 for evaluation.

In Figure 7, the average CPP and NLCI of are always higher than LIME on both data sets. This demonstrates that the interpretations computed by are more effective than LIME when they are applied to hack the target model.

Interestingly, the advantage of is more significant on FMNIST-1 than on FMNIST-2. This is because, as shown in Figure 2(a), the prediction probabilities of most instances in FMNIST-1 are either 1.0 or 0.0, which provides little gradient information for LIME to accurately approximate the classification function of the PLNN-NS. In this case, the decision features computed by LIME cannot describe the exact behavior of the target model.

In summary, since produces the exact and consistent interpretations for a target model, it achieves an advanced hacking performance over LIME.

CP #Ankle Boot #Bag Accuracy
1 / / 3,991 3,997 0.999
2 / 9 0 1.000
3 / / 0 3 1.000
Table 6. The PBs of the top-3 convex polytopes (CP) containing the most instances in FMNIST-1. “/” indicates a redundant linear inequality. Accuracy is the training accuracy of LLC on each CP.
CP #Coat #Pullover Accuracy
1 3,932 3,942 0.894
2 32 10 0.905
3 18 0 0.944
Table 7. The PBs of the top-3 convex polytopes (CP) containing the most instances in FMNIST-2. Accuracy is the training accuracy of LLC on each CP.

5.6. Can We Debug a Model Using OpenBox?

Intelligent machines are not perfect and predictions fail occasionally. When such failure occurs, we can apply to interpret why an instance is mis-classified.

Figure 8 shows some images that are mis-classified by PLNN-NS with a high probability. In Figures 8(a)-(c), the original image is a Coat, however, since the scattered mosaic pattern on the cloth hits more features of Pullover than Coat, the original image is classified as a Pullover with a high probability. In Figures 8(d)-(f), the original image is a Pullover, however, it is mis-classified as a Coat because the white collar and breast hit the typical features of Coat, and the dark shoulder and sleeves miss the most significant features of Pullover. Similarly, the Ankle Boot in Figure 8(g) highlights more features on the upper left corner, thus it is mis-classified as a Bag. The Bag in Figure 8(j) is mis-classified as an Ankle Boot because it hits the features of ankle and heel of Ankle Boot, however, misses the typical features of Bag on the upper left corner.

In conclusion, as demonstrated by the mis-classified examples in Figure 8, accurately interprets the mis-classifications, which is potentially useful in debugging abnormal behaviors of the interpreted model.

(a) CO
(b) CO: 0.04
(c) PU: 0.96
(d) PU
(e) CO: 1.00
(f) PU: 0.00
(g) AB
(h) AB: 0.16
(i) BG: 0.84
(j) BG
(k) AB: 1.00
(l) BG: 0.00
Figure 8. The mis-classified images of (a) Coat (CO), (d) Pullover (PU), (g) Ankle Boot (AB), and (j) Bag (BG). (a), (d), (g) and (j) show the original images. For the rest subfigures, the caption shows the prediction probability of the corresponding class; the image shows the decision features supporting the prediction of the corresponding class.

6. Conclusions and Future Work

In this paper, we tackle the challenging problem of interpreting PLNNs. By studying the states of hidden neurons and the configuration of a PLNN, we prove that a PLNN is mathematically equivalent to a set of LLCs, which can be efficiently computed by the proposed method. Extensive experiments show that the decision features and the polytope boundary features of LLCs provide exact and consistent interpretations on the overall behavior of a PLNN. Such interpretations are highly effective in hacking and debugging PLNN models. As future work, we will extend our work to interpret more general neural networks that adopt smooth activation functions, such as sigmoid and .

References