Understanding Interpretability by generalized distillation in Supervised Classification

by   Adit Agarwal, et al.

The ability to interpret decisions taken by Machine Learning (ML) models is fundamental to encourage trust and reliability in different practical applications. Recent interpretation strategies focus on human understanding of the underlying decision mechanisms of the complex ML models. However, these strategies are restricted by the subjective biases of humans. To dissociate from such human biases, we propose an interpretation-by-distillation formulation that is defined relative to other ML models. We generalize the distillation technique for quantifying interpretability, using an information-theoretic perspective, removing the role of ground-truth from the definition of interpretability. Our work defines the entropy of supervised classification models, providing bounds on the entropy of Piece-Wise Linear Neural Networks (PWLNs), along with the first theoretical bounds on the interpretability of PWLNs. We evaluate our proposed framework on the MNIST, Fashion-MNIST and Stanford40 datasets and demonstrate the applicability of the proposed theoretical framework in different supervised classification scenarios.



page 3

page 6


Quantifying Interpretability and Trust in Machine Learning Systems

Decisions by Machine Learning (ML) models have become ubiquitous. Trusti...

Local Interpretability of Calibrated Prediction Models: A Case of Type 2 Diabetes Mellitus Screening Test

Machine Learning (ML) models are often complex and difficult to interpre...

A Functional Information Perspective on Model Interpretation

Contemporary predictive models are hard to interpret as their deep nets ...

Interpretation Quality Score for Measuring the Quality of interpretability methods

Machine learning (ML) models have been applied to a wide range of natura...

Human Factors in Model Interpretability: Industry Practices, Challenges, and Needs

As the use of machine learning (ML) models in product development and da...

QUACKIE: A NLP Classification Task With Ground Truth Explanations

NLP Interpretability aims to increase trust in model predictions. This m...

Hardware Acceleration of Explainable Machine Learning using Tensor Processing Units

Machine learning (ML) is successful in achieving human-level performance...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Trust and reliability form the core principles in various safety-critical applications such as medicine and autonomous driving. With the growing demand for more complex data-driven computational models for such applications, model interpretation has become the focus of major research in the past few years 46160.

The deployment of deep learning models in critical domains, such as law and medicine, requires unbiased computational estimates of interpretability. This would allow policy-makers in these domains to evaluate and select computationally more interpretable models for different tasks, free from subjective human biases.

The motivations and difficulties associated with interpretability are explained by Lipton:2018:MMI:3236386.3241340. The lack of a single notion of interpretability leaves the applications of deep learning models vulnerable, thus amplifying the need for a robust quantification of interpretability. This lack of a proper measurable bias-free definition for interpretability poses a serious problem when deep learning models fail silently, leaving end-users with no clues on the possible correction mechanisms, which can be fatal in safety-critical scenarios.

Inspired by the work of DBLP:journals/corr/abs-1806-10080, we define interpretability from an information-theoretic perspective and decouple human understanding from the current notion of interpretability. We define the process of interpretation as a communication mechanism between two models and define interpretability relative to their model entropies. Since the abstraction levels defined in DBLP:journals/corr/abs-1806-10080 do not properly define the entropies of Machine Learning (ML) models, our work proposes a novel way of defining these entropies. Further, we derive tight lower bounds on the entropy of supervised classification models using graph theory.

Our interpretation-by-distillation framework for quantifying interpretability (refer Fig. 1) generalizes the common distillation technique 44873; providing researchers, corporations and policy-makers, better, unbiased interpretability estimates. Further, we provide the first theoretical guarantees on the interpretability of black-box Piece-Wise Linear Networks (PWLNs), when interpreted by another PWLN. The major contributions of our proposed theoretical framework are:

  • We remove the accuracy-interpretability trade-off, present in most previous works. We propose that the interpretability of any ML model depends only on the decision structure of the model, independent of the ground-truth.

  • Local surrogates such as SHAP NIPS2017_7062 and LIME DBLP:journals/corr/RibeiroSG16

    provide a localized view of robustness of classifiers only around individual data points. Our work, on the other hand, proposes a global metric for defining the interpretability of a model in relation to another model.

  • Our empirical interpretation-by-distillation mechanism represents a computational approach for interpretability. We generalize standard distillation which considers the generic human understanding of a lower complexity model being used to interpret a higher complexity model, by considering the entire spectrum of computational complexity of learning models.

2 Related Work

The necessity for humans to have confidence on the predictions of deep learning models EU has led to the development of various explanation mechanisms DBLP:journals/corr/abs-1802-01933. These mainly explore two directions - Model-based and Post-hoc methods.

Model-based interpretation mechanisms such as focus on building interpretable ML models from the bottom up using basic decision mechanisms, retaining the complexity of deep neural networks, while making it easier to interpret their decisions. InterpretableAlzheimers, murdoch2019interpretable, Caruana:2015:IMH:2783258.2788613, Abdul:2018:TTE:3173574.3174156 have explored model-based methods extensively, but they have not been able to perform at par with existing complex deep learning models.

Post-hoc interpretation mechanisms, on the other hand, have mainly focused on the visualization of deep neural networks such as CNNs DBLP:journals/corr/abs-1802-00614. Previous works by Kim2018InterpretabilityBF, DBLP:journals/corr/SelvarajuDVCPB16, DBLP:journals/corr/ShrikumarGK17

use visual cues such as Concept Activation Vectors (CAV), Grad-CAM and Layer-wise relevance scores

Bach2015OnPE respectively to enable human understanding of the complex ML models. However, these face a major challenge due to the fragile nature of the proposed interpretations, which decreases human understanding and trust in practical systems using these cues Ghorbani2018InterpretationON.

Figure 1: Interpretation as a communication mechanism between known model A and black-box model B, where A performs a series of (possibly infinite) queries to B, until it emulates B’s decision boundary and no more information gain is possible.

Our interpretation-by-distillation mechanism considers interpretation by any supervised learning model, moving beyond its usual association with human understanding (see Figure

1). DBLP:journals/corr/DhurandharILS17aa present a work most similar to ours, but suffer from the accuracy-interpretability trade-off which does not arise in our work. We de-couple the accuracy/performance of the ML model from its interpretability. In practical settings, interpretability can be computed by our empirical interpretation mechanism. As compared to knowledge distillation 44873, where a small network (student) is taught by a larger pre-trained network (teacher); our interpretation-by-distillation mechanism covers the entire spectrum of their relative computational complexities (refer Section 4.2).
As a special case of interest, we consider PWLNs and derive theoretical bounds on their interpretability, in terms of their complexity defined in previous works.

2.1 Complexity of PWLNs

A lot of previous research has focused on deriving the complexity of PWLNs. zaslavsky1975

first proposed the number of cells formed by an n-dimensional arrangement of hyperplanes in a d-dimensional space as

. serra2018bounding

propose tight upper and lower bounds on the maximium number of linear regions for ReLU and maxout networks.

NIPS2014_5422 and NotesLinearRegions present tight lower and upper bounds respectively, on the maximum number of linear regions for ReLU networks. complexityAverage provide average case complexities for ReLU networks in terms of the number of linear regions.

While we directly use most of these results for deriving the upper and average bounds on the entropy of PWLNs, we present a tighter lower bound on the entropy by incorporating the graph coloring concept.

3 Background

We consider two models A ) and B that take an -dimensional input, say and predict outputs and in the output space respectively. The input to both models need not be same (as is the case in Section 7.2). These models are trained on the dataset, containing data points and output classes.

B represents the ”black-box” model being interpreted, while A represents the ”known” model used for interpreting B, as shown in Figure 1. Model A need not be more interpretable than B from a human perspective. The interpretation process is performed using the dataset, , containing data points, where represents the prediction of B on the sample, after training.

A cell formed in the input space by a model is defined as a bounded region in the input space in which all points map to the same output. For PWLNs, a cell indicates a bounded region where all points map to same output domain as well as have the same linear nature of the PWLN. Based on this, we derive the definition of model complexity from serra2018bounding as in Definition 1.

Definition 1

Model Complexity: The complexity of a supervised classification model is defined as the number of unique cells identified by the model in the input space.

Throughout the paper, we denote the complexity of any model as , and the upper, lower and average bounds on as , and respectively.

The theoretical bounds on the complexity of PWLNs from previous research used for theoretical derivations are given in Supplementary Material.

Based on Definition 1, we define model entropy as:

Definition 2

Model Entropy: The entropy of a supervised classification model is defined as the number of unique ways of assigning output classes to different cells identified in the input space by the model.

This definition of model entropy does not take into account the shape of the cells. Throughout the paper, we denote the entropy of any model as . We denote the upper, lower and average bounds on as , and respectively.

3.1 Relation between Complexity and Entropy

A rudimentary approximation of the relationship between the complexity and entropy of any supervised classification model, say , for a -class classification problem, based on the definitions 1 and 2, is given by , representing the maximal possible number of ways of assigning the output classes.

4 Problem Formulation

As shown in Figure 1, we have two models A and B. A belongs to the class of known models i.e. those models for which the decision mechanism is available to us through access to the model internals including structure and parameters. B belongs to the class of black-box models i.e. those models whose internals are completely unknown. The semantic understanding of ”known” is unimportant in our context. The interpretation mechanism is then defined as the process of communication between these models, where A tries to emulate B’s decision boundary. This process can continue indefinitely until A has learnt all the information possible about B’s decision mechanism.

The relative simplicity of model A w.r.t. B is not important in our work. Our aim is only demonstrating the explainability of B in terms of A and not finding the most compact model. Based on the definitions given in DBLP:journals/corr/abs-1806-10080, we formally define interpretability and its associated concepts as follows:

Definition 3

Complete Interpretation: Complete Interpretation is defined as the process of model updation by communication between the known model A and the target black-box model B through an infinite set of query data points until no further information gain about the decision boundary of model B is possible.

To define the notion of interpretability, we first define two terms as and , which represent entropies of the communication process between models A and B. and represent the entropy about ’s decision boundary before and after the process of complete interpretation respectively. and can be also be obtained in terms of the model entropies and as, and .
However, it is not possible to consider the entire input space due to its massive cardinality. So, for practical purposes, the process of complete interpretation (Definition 3) is relaxed into that of empirical interpretation (Definition 4).

Definition 4

Empirical Interpretation: Empirical Interpretation is defined as the process of model updation by communication between the known model A and the target black-box model B using a finite subset of query data points.

For Empirical Interpretation, the model entropy cannot be determined. Empirically, the interpretability measure depends only on model A. Hence, the model entropies and are approximated as and
, where and represent the entropy of model A before and after the process of Empirical Interpretation respectively.
Based on these concepts, we define interpretability as:

Definition 5

Interpretability: Interpretability () is the ratio between information gain about target model ’s decision boundary through interpretation and the initial uncertainty about ’s decision boundary. More formally,


This interpretability formulation measures the maximal information gain about the decision boundary of black-box model B, through minimal querying by known model A. Thus, an interpretability value of 0.9 means that 90% of the decision boundary of B can be matched by A on a particular dataset. This process of interpretation in Definition 5, is modelled as the optimization given by . This is solved by approximating from , where is unknown and is known.

4.1 Bounds on Entropy and Interpretability

Based on the formulation in Section 3.1, we obtain the upper and average bounds on the entropy for PWLNs, in terms of the corresponding bounds on their complexities. However, we derive tighter lower bound estimates for these entropies.

To obtain the lower bounds, we model the entropy of PWLNs in terms of the graph coloring problem. The linear cells identified by the PWLN in the input space can be modelled as the vertices of a graph and the output classes represent the different colors available for coloring the vertices of . The entropy of the PWLN can now be modelled as the process of coloring the vertices of such that no two adjacent vertices are assigned the same color.

We model this process of calculating the entropy, as an iterative process through all the layers in the PWLN. At each iteration, the adjacent linear cells of the input space are approximated as a path graph in each dimension and the chromatic polynomial of path graph is used to calculate the number of ways of coloring the graph. Based on the construction, we obtain a lower bound on the entropy of an L-layer PWLN, say , with -dimensional input and output classes and

ReLU neurons in the

-th layer as:

The detailed derivation of the above is given in the Supplementary Material.

Model A vs. Model B
Example cases of PWLNs
A Single Layer ReLU network with neurons Deep ReLU network with layers with neurons in the layer     and total neurons Single Layer ReLU network with neurons
B Deep ReLU network with layers with neurons in the layer     and total neurons Single Layer ReLU network with neurons Single Layer ReLU network with neurons

Table 1:

Relative Bias-Variance trade-Off between model

A and model B along with the interpretability bounds for PWLNs for complete interpretation. The two remaining cases represented by (a) & , (b) & are not possible (In the table, represents )

4.2 Interpretation based on Bias-Variance trade-off

We consider different cases for models A and B based on the relative bias-variance trade-off. Table 1 provides a description of the 9 possible cases for complete interpretation, with examples of PWLNs and their bounds.

This is in contrast to the typical interpretation-by-distillation situation considered in most previous works on interpretability and distillation, where A is simpler than B i.e. , where and represent the Bias and Variance respectively. The distillation technique 44873 presents a similar idea of transferring knowledge from a complex to a simple network, in contrast to our work which considers the entire spectrum of complexity. In this typical distillation case (represented in column in Table 1), A can be considered as the minimum description length encoding of B.

5 Theoretical Bounds on Interpretability for PWLNs

We determine the theoretical bounds on interpretability with both models A and B as PWLNs, for complete interpretation. We denote the upper, lower and average bounds on the interpretability of model B by model A as , and respectively. Table 1 presents the obtained bounds on interpretability for PWLNs. The derivations are demonstrated in the Supplementary Material.

6 Calculation of Empirical Interpretability

Empirical interpretation is determined based on the global surrogate model, as a generalization of distillation 44873. is determined by solving the optimization problem of interpreting model B on the set , where represents the prediction of model B on the example. Before interpretation, let model A

output probability vectors on

as , where . After interpretation, let model A output probability vectors on as , where . Here, and represent the initial state (before interpretration) and final state (after interpretation) of model A respectively. The initial state of model A comes from the initial parameter set assigned to model A. Figure 2 represents the basic idea of the 5-step formulation used for calculating empirical interpretability, using the following color coding:

Figure 2: Empirical Interpretation (see text for details on color coding of steps 1-5)
  1. Obtain the predictions of black-box model B on input .

  2. Transform to as: for section 7.2 and otherwise.

  3. Obtain predictions and output of model A on input before interpretation.

  4. Train model A on the predictions of model B.

  5. Obtain predictions and output of model A on input after interpretation.

Let the set of predicted classes by model A on before interpretation be denoted as , where . Similarly, the set of predicted classes by model A on after interpretation is denoted as , where .

Consider the sample. Based on the above formulation; represents the probability assigned by model A to the predicted class of model B (i.e. ), before the process of interpretation. Similarly, represents the corresponding probability after the process of interpretation. On the other hand, and represent the maximum probability values assigned to any class by A before and after the process of interpretation respectively.

Now, we define the entropies and . The general notion of entropy considers and . and can be defined in a similar way. Then, the entropies are defined as:


We define the above entropies in a slightly different way. We calculate the entropies using differences in the probabilities calculated earlier (as a lower difference implies more information gain), as:


The empirical interpretability can now be formulated based on the definition of interpretability in Equation (1) and entropies defined as in Equation (3).

7 Experimental Setup and Results

Through these experiments, we demonstrate the applicability of our proposed interpretation-by-distillation framework in diferent supervised classification scenarios. Our experiments demonstrate the stability of our interpretation-by-distillation framework as well as its conformity to human understanding, while interpreting state-of-the-art deep learning models.

7.1 Overview of Experimental Setup

Figure 3: Left: Empirical Interpretability when an InceptionV3 network trained on the Stanford40 dataset is interpreted by another InceptionV3 network trained on different cropped versions of the same set of images. Right: An example from the Stanford40 dataset labelled as ”Holding An Umbrella”, showing the original image (top right), Original Cropped Image (bottom left), Cropped Top Left Image (top left) and Cropped Bottom Right Image (bottom right).

In Experiment 7.2, our quantification conforms with the human understandable visual explanations of the predictions of the state-of-the-art InceptionV3 7780677 network . In Experiment 7.3 we break down the complex structure of a MiniVGGNet in the form of a human understandable ensemble of simpler models, through our interpretation-by-distillation framework. In Experiment 7.4, we demonstrate the stability of our interpretation formulation for PWLNs (based on our theoretical formulation). Dataset: In Experiment 7.2, we use the Stanford40 dataset 6126386, which contains 9072 RGB images corresponding to 40 different human actions. We use the popular MNIST lecun-mnisthandwrittendigit-2010 and Fashion-MNIST xiao2017/online datasets for all other experiments. Both datasets contain 60,000 and 10,000 grayscale images (of size 2828) as part of the train and test sets respectively, with 10 output classes.
Data Pre-Processing: We normalize the data using a min-max normalization and perform a 5-fold cross validation split on the training set, considering 4 folds for interpretation and the remaining fold for cross-validation. Empirical interpretability is calculated on the test set.
Notations: In the experiments below, represents the batch size and represents the learning rate. Also, the term ”Piece-Wise Linear Neural Networks with ReLU activations” is abbreviated as PWLN-R.
We implement the experimental setup in Python using the Tensorflow and Keras libraries. The experiments were conducted on a NVIDIA Tesla V100 GPU node with 192GB RAM and CentOS Linux. 111The code will be made publicly available upon acceptance.

7.2 Explaining the predictions

Visual explanation of the predictions of black-box models is key to human understanding of these models for various tasks. While our interpretation-by-distillation framework is completely theoretical; through this experiment, we provide a connection between our interpretability measure and human understanding of the decision structure of black-box models. We determine the parts of the input image which affect the classifier prediction in the maximum capacity, thus explaining where the classifier ”looks” in the image for the task-at-hand.
Model Configurations: Both models A and B

are InceptionV3 networks, pre-trained on ImageNet. We add a global-average pooling and two dense layers at the end for fine-tuning.

B is trained on the original Stanford40 images while A is trained on cropped portions of the images in the Stanford40 dataset. We use the annotations and bounding boxes for the associated objects provided with the Stanford-40 dataset. This experiment falls under the relative complexity case of both models having similar complexity in Table 1.
Hyperparameters and Pre-Processing: , ,

for both models. For 5 epochs, the models are trained using

RMSProp optimizer, and then SGD optimizer is used for 150 epochs. We also augment the dataset using Keras libraries. The first 249 layers of the models are kept fixed during training. All cropped images are resized to a 200200 shape across all the 3 channels.
Explanation: Figure 3 demonstrates the behaviour of interpretability when the Stanford40 images are cropped in 3 different fashions. A higher interpretability is obtained when the cropped images contain the objects in focus (e.g.- bottom left image in Figure 3 Right, containing the objects ”Person” and ”Umbrella” for the action ”Holding An Umbrella”). The interpretability is lower when cropping is done from the top left and bottom right corners (with the same size as that of the bounding box) (e.g. - see Fig. 3 Right) as shown by the red and green bars respectively in Figure 3. Note that, the accuracy remains the same for all the three different cropping cases, at around 70%.

Figure 4: Evaluation of Empirical Interpretability on MNIST and Fashion MNIST. Different ensembles (A) are used to interpret a MiniVGGNet (B).
Figure 5:

Effect of the number of samples on Empirical Interpretability, when a 1-layer PWLN-R(a), Decision Tree(b) and an SVM(c) are used to interpret a 4-layer PWLN-R.

7.3 Explaining the Black Box Model

Using ensembles of human understandable models to interpret black-box models encourages the breakdown of the complex decision mechanism of the black-box model into a human understandable form. Further, using an ensemble to interpret black-box model B removes the bias of the choice of model A prevalent in previous model-based interpretation mechanisms.
Model Configurations: We take B as a MiniVGGNet, and use different ensembles as A, to interpret B. This experiment falls under the relative complexity case where A is simpler than B in Table 1.
Hyperparameters: , . Both models use Adam optimizer and are trained for 150 epochs each
Explanation: Figure 4

, demonstrates the interpretability of a CNN in terms of a simple ensemble, particularly the ensemble of SVM, Logistic Regression and Decision Tree, which interprets better than other more complex ensembles on both datasets.

7.4 Effect of the number of Queries

As explained in Section 4, the process of complete interpretation is relaxed into the process of empirical interpretation for practical purposes. Due to finite queries in empirical interpretation, the choice of the dataset affects empirical interpretability. However, as the number of queries is increased, the empirical interpretation mechanism moves towards complete interpretation. As complete interpretation is achieved, the bias introduced by the choice of the dataset is removed, thus suggesting the convergence of interpretability values with increasing number of queries. Hence, we demonstrate the effect of the number of queries on the empirical interpretability (while keeping the size of test set for interpretability calculations fixed) and demonstrate convergence of interpretability with increasing number of queries.
Model Configurations: We fix model B as a 4-layer PWLN-R with 512, 256, 128 and 64 neurons in its 4 layers respectively. We use three different architectures for model A: a 1-layer PWLN-R with 256 hidden neurons, a Decision Tree with Gini criterion and an SVM with RBF kernel.
Hyperparameters: , . We train both B and A for 30 epochs each, using Adam optimizer and Truncated Normal initializer We determine the mean and deviation over three runs for each point.
Explanation: Figure 5 demonstrates that by increasing the number of queries (1-100% of the dataset) for A to perform empirical interpretation on B, it is better able to match the decision boundary of the black-box model B and a stable value of interpretability is achieved (demonstrated by the decreasing deviations of interpretability values). As the number of query points increases to 100% of the data, we observe that the interpretability converges.

All the models perform better (in terms of fidelity) on MNIST as compared to Fashion-MNIST, owing to higher complexity of Fashion-MNIST. However as in Fig. 5(b), despite obtaining higher accuracy on MNIST, the interpretability is lower for Decision Tree classifier, thus showing that our interpretability formulation is decoupled from accuracy.

8 Discussion and Future Work

Our interpretation-by-distillation framework provides a novel interpretability definition from an information-theoretic perspective, with the aim of providing policymakers an unbiased interpretability estimate. We also provide the first theoretical bounds on the interpretability of ReLU networks and demonstrate stability and conformity of our proposed formulation to human understanding.

The current theoretical interpretability estimates are quite far from the obtained empirical estimates (due to the exponential nature of the complexity estimates from previous literature), but present a good starting point for developing tighter estimates. As tighter complexity bounds will be derived in future, the theoretical measures will closely follow the empirical ones.

In our future work, we also plan to explore the effect of complexity of the dataset used for interpretation. As the dataset complexity increases, it becomes increasingly difficult to emulate the decision boundary of the black box model. We plan to take this into consideration for our interpretability formulation. Further, our model entropy definition currently only considers the arrangement of the decision boundary. Our future work would focus on incorporating the shape of the decision boundary as well into our entropy definition.

Our future work would aim to derive tight upper and average bounds on the entropies of PWLNs, just like the tight lower bounds derived in the current work. Further, pur proposed theoretical interpretability bounds are limited to PWLNs, since previous literature has explored complexity bounds for only PWLNs. As and when the complexities for networks like CNNs are defined, our work can be extended to these networks as well.

The current work presents a first direction towards a new definition for interpretability, dissociated from human understanding. This presents researchers with a previously unexplored notion of quantifying interpretability using information theory which we hope inspires other researchers to explore further.


9 Supplementary Material

9.1 Complexity of PWLNs

Consider two PWLNs with ReLU activations, say and . is a single-layer PWLN with hidden neurons. is a deep PWLN with layers and neurons in its hidden layer. Let the total number of hidden neurons in the models and be and respectively. Then, the complexity bounds are summarized as:

  • For model , is given by zaslavsky1975, while .

  • For model , is given by , where serra2018bounding. denotes the set of integers.

  • For model , is given by NIPS2014_5422. We do not consider the lower bound given by serra2018bounding as it is more restrictive, requiring as compared to in NIPS2014_5422,

  • and are given by and respectively, when . For , and are given by and respectively, where

    represents the number of breakpoints in the non-linearity of the activation function of the neural network (for ReLU,

    ). complexityAverage.

9.2 Interpretability Formulation

As defined in the interpretation mechanism, we have two models and where is the black-box model which is being interpreted by the known model . Then, we define the entropy as the number of mappings in the input space identified by model but not identified by model , before the process of complete interpretation. Similarly, we define as the number of mappings in the input space identified by model but not identified by model , after the process of complete interpretation. Hence,

9.3 Model Entropy

The derivation here is based on a similar concept as used by Montufar et al. NIPS2014_5422 for the derivation of the lower bound on the maximal number of linear regions for deep rectifier networks. The derivation of NIPS2014_5422 clearly demonstrates that every linear region identified in the input space by the PWLN maps to the same region in the output space, hence for PWLNs our derivation holds as for PWLNs, cells are the same as linear regions. We demonstrate the derivation for rectifier networks, but the result can be extended to other PWLNs as well using a similar idea.

Consider a L-layered deep neural network composed of ReLU activations and containing neurons in its -th layer. Let represent the number of input variables, where . Now, partition the set of neurons in the layer into subsets, each with cardinality . For simplification, we assume divides and there are no remaining neurons, however, the construction can be easily modified for the case of remaining neurons as well.

As demonstrated in NIPS2014_5422, an alternating sum of rectifier units divides the input space into equal-length segments. If we consider the rectifier units in the subset, we can choose the input weights and biases of the units in this subset such that they are only sensitive to the coordinate of the input and the output activations of these units are given in NIPS2014_5422.

Figure S1: Linear Regions in each input dimension represented as a path graph

The alternating sum of these rectifier units of the subset produces a new function which effectively acts on only the scalar input .

This construction identifies p cells in each coordinate of the input , given as . Now, considering all the subsets each of which operates on distinct dimensions of the input, we get a total of hypercubes which all map into the same output space.

As illustrated in the construction in NIPS2014_5422, the identified input cells are continuous along each dimension. Thus, the cells can be identified as forming a graph, where each cell represents a vertex and any two continuous/adjacent cells are connected directly via an edge. The linear cells thus form a linear chain along each dimension as shown in Figure S1, which when modelled graphically represents a path graph.
Based on the definition of the entropy of a supervised classification model as given in Definition 2, we can visualize the problem of assigning classes to the cells in the input space as a graph coloring problem, where there are vertices and colors (representing the classes). For the graph coloring problem, the chromatic polynomial 10.2307/1967597 defines a graph polynomial which counts the number of colorings possible for the graph as a function of the number of colors.

Having modelled the cells in each dimension as a path graph, our problem is essentially converted into a path-graph-coloring problem with vertices and colors. For a path graph, the chromatic polynomial is given by , where is the number of vertices. So, the number of possible colorings for the defined path graph is given as , for each subset defined in the -th layer. Extending this for all the subsets, we have the total number of possible colorings as . This is true for layers . Hence, the total number of possible colorings on the cells in the input space, upto layer is given by .

For the last layer , the number of linear cells formed is given by . Hence, using a similar construction as earlier, we can say that the last layer defines a total of linear cells (forming a path graph), in each of the input dimensions, so the total number of possible colorings induced by the last layer is given by . As a result, the total number of possible colorings in the input space formed by the entire network is given by .

9.4 Interpretability of PWLNs based on the relative Bias-Variance Trade-Off

Let model be a 1-layer PWLN with ReLU activations, with hidden neurons, and model be a deep PWLN with ReLU activations, with layers and neurons in its hidden layer. Both models take an -dimensional input. Let the total number of neurons in the models and be given by and respectively. Based on the complexity bounds defined in Section 3, we can obtain the values given in Table S1.

Model as Model as

Table S1: Bounds on Complexity and Entropy for PWLNs with ReLU activations. (In the table, represents and )

Thus, the various bounds on interpretability, , and can be defined as:

The formulae in Table 1 are simple applications of the formulae presented above.

9.5 Experiment: Effect of the Optimizer

Figure S2: Empirical Interpretability between models A and B computed by different optimizers.

We study the effect of different optimization techniques (an implementation choice) for the optimization process (given at the end of Section 4) over our empirical interpretability.
Model Configurations: Models A and B have the same configurations as in Section 7.4. The experiment is performed on the MNIST dataset. The box plots are constructed using .
Explanation: The box plots in Figure S2 demonstrates that both RMSProp and Adam are very stable and perform better than other optimizers, when used by model A to interpret model B. Further, due to a finite number of queries, different optimizers affect the optimization process differently.

9.6 Experiment: Explaining PWLNs

Figure S3: Evaluation of Empirical Interpretability on MNIST and Fashion MNIST. An ensemble of an SVM, Decision Tree and 1-layer PWLN-R (A) is used to interpret a 4-layer PWLN-R (B)

We use an ensemble to interpret PWLNs, which breaks down their complex decision structures into ensembles of simpler models. This is an extension to experiment 7.3.
Model Configurations: We fix model B as a 4 layer PWLN-R with 512, 256, 128 and 64 neurons in its 4 layers respectively. We use an ensemble of an SVM, Decision Tree and a 1-layer PWLN-R with 256 neurons, with model averaging technique, as model A.
Hyperparameters: and . Both models are trained for 30 epochs using Adam optimizer for both models.

Explanation: Figure S3 shows that the decision structure of the black-box 4 layer PWLN-R can be explained in terms of the simplified decision structure of the ensemble A with high interpretability.

9.7 Double Descent Behaviour

Figure S4: Double Ascent behaviour when a 1-layer PWLN-R with 512 neurons is interpreted by a 1-layer PWLN-R with different number of hidden neurons.

The behaviour of modern deep learning methods is quite at odds with the classical U-shaped risk curve defined by the bias-variance trade-off. Modern deep learning architectures, on the contrary, tend to obtain high accuracy on both the test and train sets.

doubledescent demonstrates the double descent curve, which better explains the behaviour of modern deep learning networks. We determine whether our interpretation-by-distillation mechanism is in conformance with the equivalent double ascent curve of interpretability.
Model Configurations: B is a 1-layer PWLN-R with 512 hidden neurons, A as a 1-layer PWLN-R with increasing number of hidden neurons in the order . This covers the entire relative complexity spectrum between A and B.
Hyperparameters: , . We train both models for 20 epochs with Adam optimizer, Truncated Normal kernel initializer, all biases initialized to 1 and no regularizers. We obtain the mean and deviation over three runs.
Explanation: Figure S4 demonstrates the conformity of our interpretability formulation with the double ascent behaviour. This shows that our formulation replicates behaviours that are fundamental in modern deep learning.