# Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation

Current methods to interpret deep learning models by generating saliency maps generally rely on two key assumptions. First, they use first-order approximations of the loss function neglecting higher-order terms such as the loss curvatures. Second, they evaluate each feature's importance in isolation, ignoring their inter-dependencies. In this work, we study the effect of relaxing these two assumptions. First, by characterizing a closed-form formula for the Hessian matrix of a deep ReLU network, we prove that, for a classification problem with a large number of classes, if an input has a high confidence classification score, the inclusion of the Hessian term has small impacts in the final solution. We prove this result by showing that in this case the Hessian matrix is approximately of rank one and its leading eigenvector is almost parallel to the gradient of the loss function. Our empirical experiments on ImageNet samples are consistent with our theory. This result can have implications in other related problems such as adversarial examples as well. Second, we compute the importance of group-features in deep learning interpretation by introducing a sparsity regularization term. We use the L_0-L_1 relaxation technique along with the proximal gradient descent to have an efficient computation of group feature importance scores. Our empirical results indicate that considering group features can improve deep learning interpretation significantly.

## Authors

• 30 publications
• 21 publications
• 14 publications
• 43 publications
• ### Certifiably Robust Interpretation in Deep Learning

Although gradient-based saliency maps are popular methods for deep learn...
05/28/2019 ∙ by Alexander Levine, et al. ∙ 0

• ### Small steps and giant leaps: Minimal Newton solvers for Deep Learning

We propose a fast second-order method that can be used as a drop-in repl...
05/21/2018 ∙ by João F. Henriques, et al. ∙ 0

• ### WoodFisher: Efficient second-order approximations for model compression

Second-order information, in the form of Hessian- or Inverse-Hessian-vec...
04/29/2020 ∙ by Sidak Pal Singh, et al. ∙ 0

• ### Noise-adding Methods of Saliency Map as Series of Higher Order Partial Derivative

06/08/2018 ∙ by Junghoon Seo, et al. ∙ 0

• ### Interpretation of Neural Networks is Fragile

In order for machine learning to be deployed and trusted in many applica...
10/29/2017 ∙ by Amirata Ghorbani, et al. ∙ 0

• ### Quasi-Newton Methods for Deep Learning: Forget the Past, Just Sample

We present two sampled quasi-Newton methods for deep learning: sampled L...
01/28/2019 ∙ by Albert S. Berahas, et al. ∙ 0

• ### A Group Theoretic Perspective on Unsupervised Deep Learning

Why does Deep Learning work? What representations does it capture? How d...
04/08/2015 ∙ by Arnab Paul, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The growing use of deep learning in sensitive applications such as medicine, autonomous driving, and finance raises concerns about human trust in machine learning systems. For trained models, a central question is test-time

interpretability: how can humans understand the reasoning behind model predictions? A common interpretation approach is to identify the importance of input features for a model’s prediction. A saliency map can then visualize important pixels of an image (Simonyan et al., 2014; Sundararajan et al., 2017) or words in a sentence (Li et al., 2016).

In the last couple of years, several approaches have been proposed to tackle this problem. For example, reference (Simonyan et al., 2014) computes the gradient of the class score with respect to the input while reference (Smilkov et al., 2017) computes the average gradient-based importance values generated from several noisy versions of the input. Reference (Sundararajan et al., 2017) defines a baseline, which represents an input absent of information and determines feature importance by accumulating gradient information along the path from the baseline to the original input. Reference (Alvarez-Melis & Jaakkola, 2018)

builds interpretable neural networks by learning basis concepts that satisfy an interpretability criteria while reference

(Adebayo et al., 2018) proposes methods to assess the scope and quality of saliency maps. Although these methods can produce visually pleasing results, they often make weak model approximations (Adebayo et al., 2018; Nie et al., 2018) and can be sensitive to noise and adversarial perturbations (Kindermans et al., 2017; Ghorbani et al., 2017).

Existing deep learning interpretation methods mainly rely on two key assumptions:

• The gradient-based loss surrogate assumption: For computational efficiency, several existing methods (e.g. (Simonyan et al., 2014; Smilkov et al., 2017; Sundararajan et al., 2017)) assume that the loss function is almost linear at the test point. Thus, they use variations of the input gradient to compute feature importance.

• The isolated feature importance assumption: Current methods evaluate the importance of each feature in isolation, assuming all other features are fixed. Features, however, may have complex inter-dependencies that can be learned by the model.

In this work, we study the impact of relaxing these assumptions in deep learning interpretation. To relax the first assumption, we use the second-order approximation of the loss function by keeping the Hessian term in its Taylor’s expansion. For a deep ReLU network and the cross entropy loss function, we compute the Hessian term in closed-form. Using the closed-form formula for the Hessian matrix, we prove the following result:

###### Theorem 1 (informal version)

If the probability of the predicted class is close to one and the number of classes is large, solutions of the first-order and second-order interpretation methods are sufficiently close to each other.

We present a formal version of this result in Theorem 5. We validate this result empirically as well. For example, in ImageNet which has more than 1,000 classes, we show that incorporating the Hessian term in deep learning interpretation has small impact for most images. This is consistent with our theory.

The key proof idea of this result follows from the fact that when the number of classes is large and the confidence in the predicted class is high, the Hessian of the loss function is approximately of rank one. More specifically, the largest eigenvalue squared is significantly larger than the sum of squared remaining eigenvalues. Moreover, the corresponding eigenvector is approximately parallel to the gradient vector (Theorem

4). This makes the first-order and second-order methods to perform similarly to each other. Note that this result can be extended to some other related problems such as adversarial examples where most common methods are based on the first-order approximation of the loss function(Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2015; Carlini & Wagner, 2016).

In the second part of the paper, we relax the second assumption of current interpretation approaches (i.e. the isolated feature importance assumption). To incorporate feature inter-dependencies in deep learning interpretation, we define the importance function over subsets of features, referred to as group-features. We adjust the subset size on a per-example basis using an unsupervised approach, making the interpretation method context-aware. Including the group-feature in deep learning interpretation makes the optimization to be combinatorial. To circumvent computational issues, we use an relaxation as is common in compressive sensing (Candes & Tao, 2005; Donoho, 2006)

, the LASSO regression

(Tibshirani, 1996), etc. To efficiently compute a solution for the relaxed optimization, we employ the proximal gradient descent (Parikh & Boyd, 2014). Our empirical results indicate that incorporating group-features significantly improves the quality of interpretation results.

Below we summarize our contributions in this paper:

• We study the impact of the second-order approximation of the loss function in deep learning interpretation. We prove that, under certain conditions, solutions of the first-order and second-order interpretation methods are sufficiently close to each other (Theorems 4 and 5). This result can be insightful in other related problems such as adversarial examples. Our empirical results on ImageNet samples are consistent with our theory.

• To prove the above Hessian result, we compute the Hessian matrix of a deep ReLU network in a closed-form. This result can be of independent interest to readers (Proposition 1 and Theorem 2).

• Finally, we include inter-dependencies among features in deep learning interpretation by computing the importance of group-features. Borrowing some results from compressive sensing (Section 4 and Appendix Section E), we develop a computationally efficient approach to solve the underlying optimization. Our empirical results indicate that considering group features significantly improves deep learning interpretation.

In what follows, we explain these results in more details. All proofs have been presented in Supplementary Materials.

## 2 Problem Setup and Notation

Consider a prediction problem from input variables (features) to an output variable . For example, in the image classification problem, is the space of images and is the set of labels . We observe samples from these variables, namely . Let be the observed empirical distribution.111Note that for simplicity, we hide the dependency of on . The empirical risk minimization (ERM) approach computes the optimal predictor for a loss function using the following optimization:

 minθ∈Θ  EPX,Y[ℓ(fθ(x),y)]. (1)

Let be a subset of with cardinality . For a given sample , let indicate the features of in positions . We refer to as a group-feature of . The importance of a group-feature is proportional to the change in the loss function when is perturbed. We select the group-feature with maximum importance and visualize that subset in a saliency map.

###### Definition 1 (Group-Feature Importance Function)

Let be the optimizer of the ERM problem (1). For a given sample , we define the group-feature importance function as follows:

 Ik,ρθ∗(x,y):=max~x ℓ(fθ∗(~x),y) (2) ∥~x−x∥0≤k, ∥~x−x∥2≤ρ,

where counts the number of non-zero elements of its argument (known as the norm). The parameter characterizes an upper bound on the cardinality of the group-features. The parameter characterizes an upper bound on the norm of feature perturbations.

If is the solution of optimization (2), then the vector is the feature importance values that are visualized in the saliency map. Note, when this definition simplifies to current feature importance formulations which consider features in isolation. When , our formulation can capture feature interdependencies. Parameters and in general depend on the test sample (i.e., the size of the group-features are different for each image and model). We introduce an unsupervised metric to determine these parameters in Section 4.1, but assume these parameters are given for the time being.

The cardinality constraint

(i.e. the constraint on the group-feature size) leads to a combinatorial optimization problem in general. Such a sparsity constraint has appeared in different problems such as compressive sensing

(Candes & Tao, 2005; Donoho, 2006) and LASSO regression (Tibshirani, 1996). Under certain conditions, one can show that without loss of generality, the norm can be relaxed with the (convex) norm (Appendix Section E).

Our goal is to solve optimization (2) which is non-linear and non-concave in . Current approaches do not consider the cardinality constraint and optimize by linearizing the objective function (i.e., using the gradient). To incorporate group features into the current methods, we can add the constraints of optimization (2) to the objective function using Lagrange multipliers. This yields the following Context-Aware First-Order (CAFO) interpretation function.

###### Definition 2 (The CAFO Interpretation)

For a given sample , we define the Context-Aware First-Order (CAFO) importance function as follows:

 ~Iλ1,λ2θ∗(x,y):=maxΔ∇xℓ(fθ∗(x),y)tΔ−λ1∥Δ∥1−λ2∥Δ∥22 (3)

where and are non-negative regularization parameters. We refer to the objective of this optimization as , hiding its dependency on and to simplify notation.

Large values of regularization parameters and in optimization (3) correspond to small values of parameters and in optimization (2). Incorporating group-features naturally leads to a sparsity regularizer through the penalty. Note, this is not a hard constraint which forces a sparse interpretation. Instead, given proper choice of the regularization coefficients, the interpretation will reflect the sparsity used by the underlying model. In Section 4.1, we detail our method for setting for a given test sample (context-aware) based on the sparsity ratio of CAFO’s optimal solution. Moreover, in Appendix Section E, we show that under some general conditions, optimization (3) can be solved efficiently and its solution matches that of the original optimization (2).

To have a better approximation of the loss function, we use the second-order Taylor expansion of the loss function around point as follows:

 (4)

where and is the Hessian of the loss function on the input features (note is fixed). This second-order expansion of the loss function decreases the interpretation’s model approximation error.

We show that by choosing proper values for regularization parameters, the resulting optimization using the second-order surrogate loss is strictly a convex minimization (or equivalently concave maximization) problem, allowing for efficient optimization using gradient descent (Theorem 3). Moreover, even though the Hessian matrix can be expensive to compute for large neural networks, gradient updates of our method only require the Hessian-vector product (i.e., ) which can be computed efficiently (Pearlmutter, 1994). This yields the following Context-Aware Second-Order (CASO) interpretation function.

###### Definition 3 (The CASO Interpretation)

For a given sample , we define the Context-Aware Second-Order (CASO) importance function as follows:

 ~Iλ1,λ2θ∗(x,y):=maxΔ∇xℓ(fθ∗(x),y)tΔ+12ΔtHxΔ−λ1∥Δ∥1−λ2∥Δ∥22 (5)

We refer to the objective of this optimization as . and are defined as in (3).

## 3 Understanding Impact of the Hessian in Deep Learning Interpretation

Hessian is by definition useful when the loss function at the test point has high curvature. However, given the linear nature of popular network architectures with piecewise linear activations (e.g., ReLu (Glorot et al., 2011), Maxout (Goodfellow et al., 2013)), do these regions of high curvature even exist? We answer this question for neural networks with piecewise linear activations by providing an exact calculation of the input Hessian. We use this derivation to understand the impact of including the Hessian term in deep learning interpretation. More specifically, we prove that when the probability of the predicted class is 1 and the number of classes is large, the second-order interpretation is similar to the first-order one. We verify this theoretical result experimentally over images in the IMAGENET dataset. We also observe that when the confidence in the predicted class is low, the second-order interpretation can be significantly different from the first-order interpretation. Since second-order interpretations take into account the curvature of the model, we conjecture that they are more faithful to the underlying model in these cases.

### 3.1 A Closed-form Hessian Formula for Deep ReLU Networks

We present an abridged version of the exact Hessian calculation here while the details are provided in Appendix Section  A.1

. Neural network models which use piecewise linear activation functions have class scores (logits) which are linear functions of the input

222Note that we ignore points that the function is non-differentiable at as they form a measure zero set.. The network can thus be written as:

 fθ(x) =WTx+b,

where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. Note that combines weights of different layers from the input to the output of the network. Each row of is the gradient of logit with respect to flattened input

and can be handled in auto-grad software such as PyTorch

(Paszke et al., 2017). We define:

 p=softmax(fθ(x)) ℓ(fθ(x),y)=−c∑i=1yi% log(pi),

where denotes the number of classes, denotes the class probabilities, and is the cross-entropy loss function.

In this case, we have the following result:

###### Proposition 1

is given by:

 Hx=∇2xℓ(p,y) =W(diag(p)−ppT)WT (6)

where is a diagonal matrix whose diagonal elements are equal to .

The first observation from Proposition 1 is as follows:

###### Theorem 2

is a positive semidefinite matrix.

These two results allow an extremely efficient computation of the Hessian’s eigenvectors and eigenvalues using the Cholesky decomposition of . See Appendix Section C for full details. Note the use of decomposition is critical as storing the Hessian requires intractable amounts of memory for high dimensional inputs. The entire calculation of the Hessian’s decomposition for ImageNet using a ResNet-50 (He et al., 2016) runs in approximately 4.2 seconds on an NVIDIA GTX 1080 Ti.

To the best of our knowledge, this is the first work which derives the exact Hessian decomposition for piecewise linear networks. Yao et al. 2018 (Yao et al., 2018) also proved the Hessian for piecewise linear networks is at most rank but did not derive the exact input Hessian.

One advantage of having a closed-form formula for the Hessian matrix (6) is that we can use it to properly set the regularization parameter in CASO’s formulation. To do this, we rely on the following result:

###### Theorem 3

If is the largest eigenvalue of , for any value of , the second-order interpretation objective function (5) is strongly concave.

We use Theorem 3 to set the regularization parameter for CASO. We need to set to make the optimization convex, but not set so large that it overpowers . In particular, we set , where we choose . For CAFO, we set

. We estimate

using the power-iteration method. In our experiments, we found that around 10 iterations are sufficient for convergence of the power iteration method.

### 3.2 Theoretical results on the Hessian impact

In this section, we leverage the exact Hessian calculation to prove that when the probability of predicted class is 1 and number of classes is large, the Hessian of a piecewise linear neural network is approximately of rank one and its eigenvector is approximately parallel to the gradient. Since a constant scaling does not affect the visualization, this causes the two interpretations to be approximately similar to each other.

###### Theorem 4

If the probability of the predicted class=1-(c-1) , where , then as c such that , Hessian is of rank one and its eigenvector is parallel to the gradient.

Let be the optimal solution to the CASO objective 5 and be the optimal solution for the CAFO objective 3. We assume =0 for both the objectives.

###### Theorem 5

If the probability of the predicted class=1-(c-1) , where , then as c such that , the CASO solution (5) with is almost parallel to the CAFO solution (3) with .

### 3.3 Empirical results on the Hessian impact

In this section, we present empirical results on the impact of the second-order loss approximation in deep learning interpretation. In experiments of this section, to isolate the impact of the Hessian term, we assume in both CASO and CAFO optimizations.

A consequence of Theorem 3

is that the gradient descent method with Nesterov momentum converges to the global optimizer of the second-order interpretation objective objective with a convergence rate of

(Appendix Section B).

To optimize , the gradient is given by:

 ∇Δ~ℓ(Δ)=∇xℓ(fθ∗(x),y)+HxΔ−2λ2Δ. (7)

The gradient term and the regularization term

are straightforward to implement using standard backpropagation.

To compute the Hessian-vector product term , we rely on the result of Pearlmutter 1994 (Pearlmutter, 1994): a Hessian-vector product can be computed in the same time as the gradient . This is handled easily in modern auto-grad software. Moreover, for ReLU networks, our closed-form formula for the Hessian term (Theorem 1) can be used in computation of the Hessian-vector product as well. In our experiments, we use the closed-form formula for and proximal gradient descent for .

We compare second-order interpretations (CASO with ) and first-order variant (CAFO with ) empirically. Note that when , where is the gradient and is the interpretation obtained using the CAFO objective.

We compute second-order and first-order interpretations for 1000 random samples on the ImageNet ILSVRC-2013 (Russakovsky et al., 2015) validation set using a Resnet-50 (He et al., 2016) model. Our loss function is the cross-entropy loss. After calculating for all methods, the values must be normalized for visualization in a saliency map. We apply a normalization technique from existing work which we describe in Appendix Section D.

We plot the Frobenius norm of the difference between CASO and CAFO in Figure 1. Before taking the difference, we normalize the solutions produced by CASO and CAFO to have the same norm because a constant scaling of elements of does not change the visualization.

The empirical results are consistent with our theoretical results (Figure 1): the second-order interpretation results are similar to the first-order ones when the classification confidence probability is high. However, when the classification confidence probability is small, including the Hessian term can be useful in deep learning interpretation.

To observe the difference between CAFO and CASO interpretations in both regimes qualitatively, we compare them for an image when the confidence probability is high and for one where it is low in Figure 2. When the confidence probability is high, CAFO CASO and when this probability is low, CASO CAFO.

## 4 Understanding Impact of the group-feature

In this section, we study the impact of the group feature in deep learning interpretation. The group feature has been included as the sparsity constraint in optimization (2).

To obtain an unconstrained concave optimization for the CASO interpretation, we relaxed the sparsity (cardinality) constraint (often called an norm constraint) to a convex norm constraint. Such a relaxation is a core component for popular learning methods such as compressive sensing (Candes & Tao, 2005; Donoho, 2006) or LASSO regression (Tibshirani, 1996). Using results from this literature, we show this relaxation is tight under certain conditions on the Hessian matrix (see Appendix Section E). In other words, the optimal of optimization (5) is sparse with the proper choice of regularization parameters.

Note that the regularization term is a concave function for . Similarly due to Theorem 3, the CASO interpretation objective (5) is strongly concave.

One method for optimizing this objective is to apply the gradient descent method used in the second-order interpretation but with the addition of an regularization penalty. In our early experiments, we found that this procedure leads to poor convergence properties in practice. This is partially due to the non-smoothness of the regularization term.

To resolve this issue, we instead use the proximal gradient descent to compute a solution for CAFO and CASO when . Using the Nesterov momentum method and backtracking with proximal gradient descent gives a convergence rate of where is the number of gradient updates (Appendix Section B). Proximal GD has been used in other deep learning problems including adversarial examples as well (e.g. (Chen et al., 2017)).

Below we explain how we use the proximal gradient descent to include the group features in deep learning interpretation. First, we write the objective function as the sum of a smooth and non-smooth function:

 ~ℓ(Δ)= ∇xℓ(fθ∗(x),y)tΔ+12ΔtHxΔ−λ2∥Δ∥22Smooth Part −λ1∥Δ∥1Non-Smooth Part

Let be the smooth, be the non-smooth part:

 g(Δ)=∇xℓ(fθ∗(x),y)tΔ+12ΔtHxΔ−λ2∥Δ∥22 h(Δ)=−λ1∥Δ∥1 ~ℓ(Δ)=g(Δ)+h(Δ)

The gradient of the smooth objective is given by:

 ∇Δg(Δ)=∇xℓ(fθ∗(x),y)+HxΔ−2λ2Δ

The proximal mapping is given by:

 proxα(x)=argminz1α∥x−z∥22+λ1∥z∥1=⎧⎨⎩x+λ1αx≤−λ1α0−λ1α

This formula can be understood intuitively as follows. If the magnitude of some elements of is below a certain threshold (), proximal mapping sets those values to zero. This leads to values that are exactly zero in the saliency map. This can be viewed as removing noise by a certain thresholding procedure.

To optimize , we use FISTA (Beck & Teboulle, 2009) with backtracking and the Nesterov momentum optimizer with a learning rate of for 10 iterations and decay factor of . is initialized to zero.

FISTA takes a step with learning rate to reduce the smooth objective loss , then applies a proximal mapping to the resulting . Backtracking reduces the learning rate when the update results in higher loss.

### 4.1 Impact of group features in interpretation

In this section, our goal is to understand the impact of the group features in deep learning interpretation. In our experiments, we focus on the image classification problem because visual interpretations are intuitive and allow for comparison with prior work. We use a Resnet-50 (He et al., 2016) model on the ImageNet ILSVRC-2013 dataset (Russakovsky et al., 2015).

To gain intuition for the effect of , we show a sweep over values in Figure  3. We observe that when is set too high or too low, the interpretation breaks down as the importance values are relatively constant across the image (all high or all zero).

Different approaches to set the regularization parameter have been explored in different problems. For example, in LASSO, one common approach is to use Least Angle Regression (Efron et al., 2004).

In the deep learning interpretation problem, we propose an unsupervised method based on the sparsity ratio of the interpretation solution to set a proper value for . We define , the sparsity ratio, as the number of zero pixels divided by the total number of pixels. We tune in an unsupervised fashion (since we do not know the ground truth interpretation) by increasing until reaches all zeros. We optimize with = [0, 10, 10, 10, 6.2510, 1.2510, 2.510, 510]. For interpretations with sparsity above a certain threshold (e.g. in our examples), we choose the interpretation with the highest loss on the original model. In practice, we batch different values of to find a reasonable parameter setting efficiently.

This method selects the interpretation marked with a green box in Figures  (a)a and  (b)b. We observe that adding group-feature terms makes the interpretation to be less noisy in these examples.

## 5 Qualitative Comparision of Deep learning Interpretation Methods

In this section, we briefly review prior approaches for the deep learning interpretation and compare their performance qualitatively. The proposed Hessian and group feature terms can be potentially included in these approaches as well.

Vanilla Gradient Simonyan et al. 2013 (Simonyan et al., 2014) propose to compute the gradient of the class score with respect to the input.

SmoothGrad Smilkov et al. 2017 (Smilkov et al., 2017) argues that the input gradient may fluctuate sharply in the region local to the test point. To address this, they average gradient-based importance values generated from many noisy versions of the input.

Integrated Gradients Sundararajan et al. 2017 (Sundararajan et al., 2017) define a baseline, which represents an input absent of information (e.g., a completely zero image). Feature importance is determined by accumulating gradient information along the path from the baseline to the original input: . The integral is approximated by a finite sum.

The idea of SmoothGrad (Smilkov et al., 2017) is to “smooth” the saliency map by averaging the importance values generated from many noisy versions of the input thereby smoothing the local fluctuations in the gradient. We use a similar idea to define smooth versions of CASO and CAFO. This yields the following interpretation objective.

###### Definition 4 (The Smooth CASO Interpretation)

For a given sample , we define the smooth context-aware second-order (the Smooth CASO) importance function as follows:

 ~Iλ1,λ2θ∗(x,y):=maxΔ1nn∑1(∇zℓ(fθ∗(z),y)tΔ+12ΔtHzΔ)−λ1∥Δ∥1−λ2∥Δ∥22 (8)

where and and are defined similarly as before.

In smoothed versions, we average over number of noisy samples with set to . Smooth CAFO is defined similarly without Hessian term.

Since principled quantitative evaluations of saliency maps remain an open problem without properly annotated samples, we focus on some qualitative evaluations of different interpretation methods. Figure 4 shows a comparison between CAFO, CASO and other existing methods while more examples have been presented in Appendix Section G. We observe that including the group-feature in deep learning interpretation leads to a sparse saliency map, helping to eliminate the spurious noise and improving the quality of saliency maps.

## 6 Discussion

In this paper, we studied two aspects of the deep learning interpretation problem. First, by characterizing a closed-form formula for the Hessian matrix of a deep ReLU network, we showed that, if the confidence in the predicted class is high and the number of classes is large, first-order and second-order methods produce similar results. In the process, we also proved that the Hessian matrix is of rank one and its eigenvector is parallel to the gradient. These results can be insightful in other related problems such as adversarial examples. The extent of the Hessian impact for low confidence predictions and/or the case when the number of classes is small are among interesting directions for the future work. Second, we incorporated high-order feature dependencies in deep learning interpretation using a sparsity regularization term. This extension improves the deep learning interpretation significantly.

Although significant progresses have been made in tackling the deep learning interpretation problem, there remain some open problems as well. For example, since saleincy maps are high-dimensional vectors, they can be sensitive to noise and adversarial perturbations. Moreover, due to the lack of properly annotated datasets for the interpretation problem, the evaluation of interpretation methods are often qualitative and can be subjective. Resolving these issues are among interesting directions for the future work.

## Appendix A Proofs

### a.1 Proof of Proposition 1

In this section, we derive the closed-form formula for the Hessian of the loss function of a deep ReLU network. Since a ReLU network is piecewise linear, it is locally linear around an input . Thus the logits can be represented as:

 fθ(x) =WTx+b,

where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. In this proof, we use to denote the logits, to denote the class probabilities, to denote the label vector and c to denote the number of classes. Each column of is the gradient of logit with respect to flattened input and can be easily handled in auto-grad software such as PyTorch (Paszke et al., 2017).

Thus

 ∂^yi∂x=Wi (9) p=softmax(^y) ℓ(p,y)=−c∑i=1yilog(pi). ∇^yℓ(p,y)=p−y ⟹∂ℓ(p,y)∂^yi=pi−yi (10) ∇xℓ(p,y)=c∑i=1∂^yi∂x×∂ℓ(p,y)∂^yi Using (???) and (???), ∇xℓ(p,y)=c∑i=1Wi(pi−yi) ⟹∇xℓ(p,y)=W(p−y)

Therefore, we have:

 Hx =∇x(∇xℓ(p,y))=∇x(c∑i=1Wi(pi−yi)) Hx =c∑i=1Wi(∇x(pi−yi))T Hx =c∑i=1Wi(∇xpi)T (11)

Deriving :

 ∇xpi =c∑j=1∂^yj∂x×∂pi∂^yj ⟹∇xpi =c∑j=1(Wj×∂pi∂^yj)(Using (???)logitinputgrad) (12) ∂pi∂^yj ⟹∇^yp =diag(p)−ppT (13)
 Hx=c∑i=1c∑j=1Wi∂pi∂^yjWTj ⟹Hx=W(diag(p)−ppT)WT(Using (???))

Thus we have,

 ∇xℓ(p,y) =gx=W(p−y) (14) Hx =WAWT (15)

where

 A:=diag(p)−ppT. (16)

This completes the proof.

### a.2 Proof of Theorem 2

To simplify notation, define as in (16). For any arbitrary row of the matrix , we have

 ∑j≠i|Aij| =(∑j≠i|−pipj|) ⟹∑j≠i|Aij| =pi∑j≠ipj ⟹∑j≠i|Aij| =pi(1−pi) |Aii| =pi(1−pi)

Because , by the Gershgorin Circle theorem, we have that all eigenvalues of are positive and is a positive semidefinite matrix. Since is psd, we can write . Using (15):

 Hx=WAWT=WLLTWT=WL(WL)T

Hence is a positive semidefinite matrix as well.

### a.3 Proof of Theorem 3

The second-order interpretation objective function is given by,

 ~ℓ(Δ) =∇xℓ(fθ∗(x),y)tΔ+12ΔtHxΔ−λ2∥Δ∥22 ~ℓ(Δ) =∇xℓ(fθ∗(x),y)tΔ+12Δt(Hx−2λ2I)Δ

where ( is fixed). Therefore if , is negative definite and is strongly concave.

### a.4 Proof of Theorem 4

Let the class probabilities be denoted by , the number of classes by c and the label vector by . We again use and as defined in (14) and (15) respectively. Without loss of generality, assume that the first class is the one with maximum probability.

 Hence, y =[1,0,0,...,0]T (17)

We assume all other classes have small probability,

 pi=ϵ≈0 ∀ i∈[2, c] Since c∑i=1pi=1,⟹p1=1−(c−1)ϵ, ⟹p=[1−(c−1)ϵ, ϵ,…, ϵ]T, (18)

We define:

 A=diag(p)−ppT% where, A=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣a11a12…a1ca21a22…a2c⋮⋮⋱⋮ac1ac2…acc⎤⎥ ⎥ ⎥ ⎥ ⎥⎦ a11=1−(c−1)ϵ−(1−(c−1)ϵ)2 a1i=ai1=−(1−(c−1)ϵ)ϵ ∀i∈[2, c] aii=ϵ−ϵ2 ∀i∈[2, c] aij=−ϵ2 ∀i,j∈[2, c], i% ≠j

Ignoring terms:

 a11=(c−1)ϵ a1i=ai1=−ϵ ∀i∈[2, c] aii=ϵ ∀i∈[2, c] aij=0 ∀i,j∈[2, c], i≠j

Let be an eigenvalue of and be an eigenvector of , then .
Let be the individual components of the eigenvector. The equation can be rewritten in terms of its individual components as follows:

 (c−1)ϵv1−ϵc∑i=2vi=λv1 (19) −ϵv1+ϵvi=λvi ∀i∈[2,c] ⟹vi=ϵϵ−λv1 ∀i∈[2,c], for λ≠ϵ (20) ⟹or v1=0, for λ=ϵ (21)

We first consider the case .
Substituting in :

 (c−1)ϵv1−ϵc∑i=2vi =(c−1)ϵv1−ϵ2ϵ−λc∑i=2v1 =(c−1)ϵv1−ϵ2ϵ−λ(c−1)v1 =(c−1)ϵv1−(c−1)ϵv1ϵϵ−λ =λv1
 (c−1)ϵv1[1−ϵϵ−λ]=λv1 (c−1)ϵv1(−λ)=λv1(ϵ−λ) ⟹λv1(cϵ−λ)=0 ⟹λ=0 or v1=0 or λ=cϵ But, v1=0⟹vi=ϵϵ−λv1=0 ∀ i∈[2,c] ⟹v=0 Since v is an eigenvector, it cannot be % zero, ⟹λ=0 or λ=cϵ. Let u1 be the corresponding  % eigenvector for λ=cϵ. By substituting λ=cϵ in (???) uT1∝[1−c,1,...,1] (22) Dividing by the normalization constant, (23) uT1=1√c(c−1)[1−c,1,...,1]

Now we consider the case ,
Substituting in  :

 The space of eigenvectors for λ=ϵ is an% c−2 dimensional subspace with v1=0,c∑i=2vi=0. Let ui be the eigenvectors with λ=ϵ∀ i ∈[2, c-1] Let uc be the eigenvector with λ=0.

Writing in terms of its eigenvalues and eigenvectors,

 A=cϵu1uT1+ϵc−1∑i=2uiuTi Let A1=cϵu1uT1 Let A2=ϵc−1∑i=2uiuTi ∥A1∥F=cϵ,∥A2∥F=ϵ√c−2 Hence as, c→∞, A=A1+A2≈A1 Using (???), Hx=WAWT≈WA1WT Hx≈cϵWu1uT1WT (24)
 Using (???), gx=∇xℓ(p,y)=W(p−y) Let Wi denote the ith row of W, Using (???) and (???), gx=W1(1−c)ϵ+c∑i=2Wiϵ gx=ϵ(W1(1−c)+c∑i=2Wi) Using (???), gx=ϵ√c(c−1)Wu1 ⟹Wu1=gxϵ√c(c−1) (25) Using (???), Hx≈cϵWu1uT1WT=cϵWu1(Wu1)T Using (???), ⟹Hx≈gxgTxϵ(c−1) (26)

Thus, the Hessian is approximately rank one and the gradient is parallel to the Hessian’s only eigenvector.

### a.5 Proof of Theorem 5

We use for simplicity (14).
When = 0 in the CASO and CAFO objectives:

 The CASO objective becomes: maxΔ (gtxΔ+12ΔtHxΔ−λ2∥Δ∥22) Taking the derivative with respect to Δ and % solving: Δ∗CASO=(2λ2I−Hx)−1gx Similarly, for the CAFO objective we get: Δ∗CAFO=12λ2gx Using (???), Hx≈gxgTxϵ(c−1)=∥gx∥2ϵ(c−1)gxgTx∥gx∥2 Define μ=∥gx∥2ϵ(c−1). Thus μ is the eigenvalue of Hx for the eigenvector gx∥gx∥.

Consider the matrix :

 Let z1,…,zd be the % eigenvectors of B, where z1=gx∥gx∥ Eigenvalue for z1=2λ2−μ Eigenvalue for zi=2λ2∀i∈[2,d] B=(2λ2−μ)z1zT1+2λ2i=d∑i=2zizTi B−1=1(2λ2−μ)gxgTx∥gx∥2+12λ2i=d∑i=2zizTi Δ∗CASO=B−1gx Since each zi is orthogonal to gx ⟹Δ∗CASO=gx(2λ2−μ)=2λ2Δ∗CAFO(2λ2−μ)

Hence and since scaling does not affect the visualization, the two interpretations are equivalent.

## Appendix B Convergence of Gradient Descent to Solve CASO

A consequence of Theorem 3 is that gradient descent converges to the global optimizer of the second-order interpretation objective objective with a convergence rate of . More precisely, we have:

###### Corollary 1

Let be the objective function of the second-order interpretation objective defined in Section 2 (Definition 3). Let be the value of in the step with a learning rate . We have

 ~ℓ(Δ(t))−~ℓ(Δ∗)≤2∥Δ(0)−Δ∗∥22α(t+1)2.

## Appendix C Efficient Computation of the Hessian Matrix Using the Cholesky decomposition

By Theorem 2, the Cholesky decomposition of (defined in (16)) exists. Let be the Cholesky decomposition of . Thus, we have

 A=LLT

Let . Thus, can be re-written as .

Let the SVD of be as the following:

 B=UΣVT

Thus, we can write:

 Hx=UΣ2UT

Define