. Previous works have found difficulties in recovering images from hidden representations in the neural network[15, 5], and it is often unclear what information is being discarded .
Various methods have been proposed to interpret neural networks. The mainstream is to calculate gradient of the loss function. the input image [20, 14]. Dosovitskiy et al. proposed up-convolution networks to invert CNN feature maps back to images 
. Another direction for model interpretation is to determine the receptive field of a neuron or extract image regions that contribute the most to the neural network decision [23, 8, 10]. Other works focus on model-agnostic interpretations [17, 2, 13, 12]. Different from previous works, we consider explainable neural network models.
The recently proposed invertible network [6, 7, 22] is able to accurately reconstruct the inputs to a layer from its outputs without harming its classification accuracy. For an invertible classifier, information is only discarded at the final pooling layer and fully-connected layer, while preceding layers preserve all information of the input. This property hints at the potential to unravel the black-box and manipulate data both in the input domain and the feature domain.
In this paper, we introduce a novel method to explain the decision of a network. We show that an invertible classifier can be viewed as a two-stage model: (1) an invertible transform from the input space to the feature space; (2) a linear classifier in the feature space. For a linear classifier, we can determine the decision boundary and explain its prediction; using the invertible transform, we can determine the corresponding boundary and explanation in the input space.
After determining the projection onto the decision boundary, we perform Taylor expansion around the projection, to locally approximate the neural net as a linear function. Then we define the importance using the same method as in linear classifier cases.
Our main contributions can be summarized as:
We explicitly determine the decision boundary of a neural network classifier and explain its decision based on the boundary.
We use Taylor expansion to locally approximate the neural net as a linear function and define the numerical importance of each feature as in a linear classifier.
2 Invertible Networks
The network is composed of different invertible modules, followed by a global average pooling layer and a fully-connected layer. Details for each invertible module are described in the following sections.
2.1 Invertible Block
An invertible block serves a similar role as a building block like a residual block, except it is invertible. For the invertible block in Fig. 2, we follow the structure of the reversible block in . The input is split into two parts and by channel, such that and have the same shape.
Corresponding outputs are and with the same shape as the input. represents some function with parameters to learn, and can be any continuous function whose output has the same shape as input; an example of is shown in Fig. 2. can be convolutional layers for 2D inputs and FC layers for 1D inputs. The forward pass and inversion is calculated as:
2.2 Invertible Pooling with 2D Wavelet Transform
An invertible pooling can halve the spatial size of a feature map, and reconstruct the input from its output. We use the 2D wavelet transform at level 1 as shown in Fig. 3
. Each channel of a tensor is a 2D image. A 2D image is transformed into 4 sub-images whose height/width is half of the original image. Four sub-images are stacked into 4 channels. The inversion can be calculated by the inverse 2D wavelet transform.
2.3 Inverse of Batch Normalization
2.4 Linear Layer
The feature space usually has a high dimension compared to the number of classes for final prediction. The mapping from high-dimension to low-dimension is typically performed with an average pooling and a fully-connected (FC) layer in a convnet. These two steps combined is still a linear transform and can be denoted as:
where is a
feature vector reshaped to 1D,are spatial sizes, and is the channel number; is a block-wise constant matrix of size , representing the average pooling operation; is the weight of a FC layer with size , where is the number of classes; and combines the two steps into 1 transform matrix.
2.5 Structure of Invertible Network
The structure of a classification network is shown in Fig. 4. The network is invertible because its modules are invertible. The input image is fed into a batch normalization layer followed by an invertible pooling layer. The invertible pooling layer increases the channel number by 4 and is essential to make the tensor have an even number of channels in order to keep the same shape for and as in formula 1.
The network is divided into stages, where an invertible pooling layer connects two adjacent stages. Within each stage, multiple invertible blocks are stacked. The output from the final stage is fed into a linear layer defined in Sec. 2.4
. The probability of current data belonging to a certain class is calculated as a
of the logits.
2.6 Reconstruction Accuracy of Inversion
We build an invertible network of 110 layers. We train the network on the CIFAR10 dataset , and reconstruct the input image from the output of the final invertible block. Results are shown in Fig. 6. The distance between reconstruction and input is on the order of , validating the accuracy of inversion.
3 Interpret Model Decision
3.1 Notations of Network
The invertible network classifier can be viewed as a two-stage model:
(1) The data is transformed from the input space to the feature space by an invertible function . (2) Features pass through a linear classifier , whose parameters and are defined in Sec. 2.4.
The operation of is defined as:
where is the weight vector for class , also is the th row of in Sec. 2.4; is the bias for class ; is the inner-product operation and is the total number of classes.
3.2 Determine the Decision Boundary
The solution to formula (6) lies on a high-dimensional plane, and can be solved explicitly.
Since is invertible, we can map the decision boundary from the feature space to the input domain.
3.3 Model Decision Interpretation
3.3.1 Interpret linear models
We first consider a linear classifier for a binary problem as in Fig. 6. For a data point and its projection onto the decision boundary, is the nearest point to on the boundary; the vector () could be regarded as the explanation for the decision, as shown below:
3.3.2 Interpret non-linear model
The last layer of a neural network classifier is a linear classifier, and we can calculate from as in linear case. With invertible networks, we can find their corresponding inputs, denoted as and respectively, where is the transform function as in equation (4). Vector () is the explanation in the input domain. The explanation can be denoted as:
where is the point in the feature space, corresponding to in the input space; is the projection of onto the boundary in the feature space; and is the inversion of , as shown in Fig. 7.
3.3.3 Feature importance
For a linear model, ignoring the bias, the function for log-probability is:
where is the dimension of ; is the projection of onto the boundary, which is also the nearest point on the boundary; and is the weight for dimension .
The explanation in dimension is ; the contribution to is . Therefore, we define as the importance of feature for data .
We use Taylor expansion around to approximate the neural network with a linear model locally:
For a local linear classifier, the importance of each feature is:
where is the element-wise product, and is a vector with the same number of elements as .
4.1 Decision Boundary Visualization
For a -dimensional input space, the decision boundary is a -dimensional subspace. For the ease of visualization, we perform experiments on a 2D simulation dataset, whose decision boundary is a 1D curve.
The data points for two classes are distributed around two interleaving half circles. As shown in Fig. 8, two classes are colored with red and green. The decision boundary is colored in blue. We visualize the decision boundary in both the input domain and the feature domain.
Visualization of the decision boundary can be used to understand the behavior of a neural network. We give an example to visualize the influence of training set size on the decision boundary in Fig. 8. From left to right, the figure shows the decision boundary when training with 1% and 100% of data, respectively. As the number of training examples increases, the margin of separation in the feature domain increases, and the decision boundary in the input domain gradually captures the moon-shaped distribution. Furthermore, the decision boundary can be used to determine how the network generalizes to unseen data.
4.2 Feature Importance
We validated our proposed feature importance method on a simulation dataset using in . We created a 2 class, 10-dimensional dataset, of which only 3 dimensions are informative.
We train an invertible network and computed the importance of each dimension. Results are shown in Fig. 9
. An oracle model should give equal importance to 3 informative variables (indexed by 1, 3 and 9), while set 0 to other variables. Our invertible network successfully picks informative features, and generates feature importance comparable to random forest. Both models select the correct features.
4.3 Explain a Convolutional Invertible Network
We train a convolutional invertible classifier, achieving over 99% accuracy on the MNIST test set. For an input image, we select classes with the top 2 predicted probabilities, determine the decision boundary between these two classes as in Sec. 3.2
, calculate the projection onto the boundary, and interpolate between the input image and its projection onto the boundary in the feature domain.
Results are shown in Fig. 10. Note that for each row, only one input image (left most) is provided; the projection (right most) is calculated from the model, instead of searching for a nearest example in the dataset. So the projection demonstrates the behavior of the network. As discussed in Sec. 3.3, the difference between a data point and its projection onto the boundary can be viewed as the explanation. For example, for an image of 8, its left half vanishes in the interpolation, which explains why it’s not classified as 3; for an image of 7, a bottom line appears in the interpolation, which explains why it’s not classified as 2.
We propose a method to explicitly determine the decision boundary of an invertible neural network classifier and define the explanation for model decision and feature importance. We validate our results in experiments, and demonstrate that the transparency of invertible networks has great potential for explainable models.
-  (2011) Sequential deep learning for human action recognition. In International Workshop on Human Behavior Understanding, pp. 29–39. Cited by: §1.
-  (2015) On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one 10 (7), pp. e0130140. Cited by: §1.
-  (2016) Can we open the black box of ai?. Nature News 538 (7623), pp. 20. Cited by: §1.
A unified architecture for natural language processing: deep neural networks with multitask learning.
Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §1.
-  (2016) Inverting visual representations with convolutional networks. In , pp. 4829–4837. Cited by: §1, §1.
The reversible residual network: backpropagation without storing activations. In Advances in Neural Information Processing Systems, pp. 2214–2224. Cited by: §1, §2.1.
-  (2018) I-revnet: deep invertible networks. arXiv preprint arXiv:1802.07088. Cited by: §1.
-  (2017) Learning how to explain neural networks: patternnet and patternattribution. arXiv preprint arXiv:1705.05598. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §2.6.
-  (2017) Explaining the unexplained: a class-enhanced attentive response (clear) approach to understanding deep neural networks. In IEEE Computer Vision and Pattern Recognition (CVPR) Workshop, Cited by: §1.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
Efficient interpretation of deep learning models using graph structure and cooperative game theory: application to asd biomarker discovery. In International Conference on Information Processing in Medical Imaging, pp. 718–730. Cited by: §1.
-  (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §1.
-  (2015) Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196. Cited by: §1.
Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision 120 (3), pp. 233–255. Cited by: §1.
-  (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
-  (2017) Axiomatic attribution for deep networks. arXiv preprint arXiv:1703.01365. Cited by: §1.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
-  (2015) Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §1.
-  (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §1.
-  (2018) Interpretable convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8827–8836. Cited by: §1.
-  (2019) Invertible network for classification and biomarker selection for asd. arXiv preprint arXiv:1907.09729. Cited by: §1.
-  (2017) Visualizing deep neural network decisions: prediction difference analysis. arXiv preprint arXiv:1702.04595. Cited by: §1.