The growing use of deep learning in sensitive applications such as medicine, autonomous driving, and finance raises concerns about human trust in machine learning systems. For trained models, a central question is test-timeinterpretability: how can humans understand the reasoning behind model predictions? A common interpretation approach is to identify the importance of input features for a model’s prediction. A saliency map can then visualize important pixels of an image (Simonyan et al., 2014; Sundararajan et al., 2017) or words in a sentence (Li et al., 2016).
In the last couple of years, several approaches have been proposed to tackle this problem. For example, reference (Simonyan et al., 2014) computes the gradient of the class score with respect to the input while reference (Smilkov et al., 2017) computes the average gradient-based importance values generated from several noisy versions of the input. Reference (Sundararajan et al., 2017) defines a baseline, which represents an input absent of information and determines feature importance by accumulating gradient information along the path from the baseline to the original input. Reference (Alvarez-Melis & Jaakkola, 2018)
builds interpretable neural networks by learning basis concepts that satisfy an interpretability criteria while reference(Adebayo et al., 2018) proposes methods to assess the scope and quality of saliency maps. Although these methods can produce visually pleasing results, they often make weak model approximations (Adebayo et al., 2018; Nie et al., 2018) and can be sensitive to noise and adversarial perturbations (Kindermans et al., 2017; Ghorbani et al., 2017).
Existing deep learning interpretation methods mainly rely on two key assumptions:
The gradient-based loss surrogate assumption: For computational efficiency, several existing methods (e.g. (Simonyan et al., 2014; Smilkov et al., 2017; Sundararajan et al., 2017)) assume that the loss function is almost linear at the test point. Thus, they use variations of the input gradient to compute feature importance.
The isolated feature importance assumption: Current methods evaluate the importance of each feature in isolation, assuming all other features are fixed. Features, however, may have complex inter-dependencies that can be learned by the model.
In this work, we study the impact of relaxing these assumptions in deep learning interpretation. To relax the first assumption, we use the second-order approximation of the loss function by keeping the Hessian term in its Taylor’s expansion. For a deep ReLU network and the cross entropy loss function, we compute the Hessian term in closed-form. Using the closed-form formula for the Hessian matrix, we prove the following result:
Theorem 1 (informal version)
If the probability of the predicted class is close to one and the number of classes is large, solutions of the first-order and second-order interpretation methods are sufficiently close to each other.
If the probability of the predicted class is close to one and the number of classes is large, solutions of the first-order and second-order interpretation methods are sufficiently close to each other.
We present a formal version of this result in Theorem 5. We validate this result empirically as well. For example, in ImageNet which has more than 1,000 classes, we show that incorporating the Hessian term in deep learning interpretation has small impact for most images. This is consistent with our theory.
The key proof idea of this result follows from the fact that when the number of classes is large and the confidence in the predicted class is high, the Hessian of the loss function is approximately of rank one. More specifically, the largest eigenvalue squared is significantly larger than the sum of squared remaining eigenvalues. Moreover, the corresponding eigenvector is approximately parallel to the gradient vector (Theorem4). This makes the first-order and second-order methods to perform similarly to each other. Note that this result can be extended to some other related problems such as adversarial examples where most common methods are based on the first-order approximation of the loss function(Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2015; Carlini & Wagner, 2016).
In the second part of the paper, we relax the second assumption of current interpretation approaches (i.e. the isolated feature importance assumption). To incorporate feature inter-dependencies in deep learning interpretation, we define the importance function over subsets of features, referred to as group-features. We adjust the subset size on a per-example basis using an unsupervised approach, making the interpretation method context-aware. Including the group-feature in deep learning interpretation makes the optimization to be combinatorial. To circumvent computational issues, we use an relaxation as is common in compressive sensing (Candes & Tao, 2005; Donoho, 2006)
, the LASSO regression(Tibshirani, 1996), etc. To efficiently compute a solution for the relaxed optimization, we employ the proximal gradient descent (Parikh & Boyd, 2014). Our empirical results indicate that incorporating group-features significantly improves the quality of interpretation results.
Below we summarize our contributions in this paper:
We study the impact of the second-order approximation of the loss function in deep learning interpretation. We prove that, under certain conditions, solutions of the first-order and second-order interpretation methods are sufficiently close to each other (Theorems 4 and 5). This result can be insightful in other related problems such as adversarial examples. Our empirical results on ImageNet samples are consistent with our theory.
Finally, we include inter-dependencies among features in deep learning interpretation by computing the importance of group-features. Borrowing some results from compressive sensing (Section 4 and Appendix Section E), we develop a computationally efficient approach to solve the underlying optimization. Our empirical results indicate that considering group features significantly improves deep learning interpretation.
In what follows, we explain these results in more details. All proofs have been presented in Supplementary Materials.
2 Problem Setup and Notation
Consider a prediction problem from input variables (features) to an output variable . For example, in the image classification problem, is the space of images and is the set of labels . We observe samples from these variables, namely . Let be the observed empirical distribution.111Note that for simplicity, we hide the dependency of on . The empirical risk minimization (ERM) approach computes the optimal predictor for a loss function using the following optimization:
Let be a subset of with cardinality . For a given sample , let indicate the features of in positions . We refer to as a group-feature of . The importance of a group-feature is proportional to the change in the loss function when is perturbed. We select the group-feature with maximum importance and visualize that subset in a saliency map.
Definition 1 (Group-Feature Importance Function)
Let be the optimizer of the ERM problem (1). For a given sample , we define the group-feature importance function as follows:
where counts the number of non-zero elements of its argument (known as the norm). The parameter characterizes an upper bound on the cardinality of the group-features. The parameter characterizes an upper bound on the norm of feature perturbations.
If is the solution of optimization (2), then the vector is the feature importance values that are visualized in the saliency map. Note, when this definition simplifies to current feature importance formulations which consider features in isolation. When , our formulation can capture feature interdependencies. Parameters and in general depend on the test sample (i.e., the size of the group-features are different for each image and model). We introduce an unsupervised metric to determine these parameters in Section 4.1, but assume these parameters are given for the time being.
The cardinality constraint
(i.e. the constraint on the group-feature size) leads to a combinatorial optimization problem in general. Such a sparsity constraint has appeared in different problems such as compressive sensing(Candes & Tao, 2005; Donoho, 2006) and LASSO regression (Tibshirani, 1996). Under certain conditions, one can show that without loss of generality, the norm can be relaxed with the (convex) norm (Appendix Section E).
Our goal is to solve optimization (2) which is non-linear and non-concave in . Current approaches do not consider the cardinality constraint and optimize by linearizing the objective function (i.e., using the gradient). To incorporate group features into the current methods, we can add the constraints of optimization (2) to the objective function using Lagrange multipliers. This yields the following Context-Aware First-Order (CAFO) interpretation function.
Definition 2 (The CAFO Interpretation)
For a given sample , we define the Context-Aware First-Order (CAFO) importance function as follows:
where and are non-negative regularization parameters. We refer to the objective of this optimization as , hiding its dependency on and to simplify notation.
Large values of regularization parameters and in optimization (3) correspond to small values of parameters and in optimization (2). Incorporating group-features naturally leads to a sparsity regularizer through the penalty. Note, this is not a hard constraint which forces a sparse interpretation. Instead, given proper choice of the regularization coefficients, the interpretation will reflect the sparsity used by the underlying model. In Section 4.1, we detail our method for setting for a given test sample (context-aware) based on the sparsity ratio of CAFO’s optimal solution. Moreover, in Appendix Section E, we show that under some general conditions, optimization (3) can be solved efficiently and its solution matches that of the original optimization (2).
To have a better approximation of the loss function, we use the second-order Taylor expansion of the loss function around point as follows:
where and is the Hessian of the loss function on the input features (note is fixed). This second-order expansion of the loss function decreases the interpretation’s model approximation error.
We show that by choosing proper values for regularization parameters, the resulting optimization using the second-order surrogate loss is strictly a convex minimization (or equivalently concave maximization) problem, allowing for efficient optimization using gradient descent (Theorem 3). Moreover, even though the Hessian matrix can be expensive to compute for large neural networks, gradient updates of our method only require the Hessian-vector product (i.e., ) which can be computed efficiently (Pearlmutter, 1994). This yields the following Context-Aware Second-Order (CASO) interpretation function.
Definition 3 (The CASO Interpretation)
For a given sample , we define the Context-Aware Second-Order (CASO) importance function as follows:
We refer to the objective of this optimization as . and are defined as in (3).
3 Understanding Impact of the Hessian in Deep Learning Interpretation
Hessian is by definition useful when the loss function at the test point has high curvature. However, given the linear nature of popular network architectures with piecewise linear activations (e.g., ReLu (Glorot et al., 2011), Maxout (Goodfellow et al., 2013)), do these regions of high curvature even exist? We answer this question for neural networks with piecewise linear activations by providing an exact calculation of the input Hessian. We use this derivation to understand the impact of including the Hessian term in deep learning interpretation. More specifically, we prove that when the probability of the predicted class is 1 and the number of classes is large, the second-order interpretation is similar to the first-order one. We verify this theoretical result experimentally over images in the IMAGENET dataset. We also observe that when the confidence in the predicted class is low, the second-order interpretation can be significantly different from the first-order interpretation. Since second-order interpretations take into account the curvature of the model, we conjecture that they are more faithful to the underlying model in these cases.
3.1 A Closed-form Hessian Formula for Deep ReLU Networks
We present an abridged version of the exact Hessian calculation here while the details are provided in Appendix Section A.1222Note that we ignore points that the function is non-differentiable at as they form a measure zero set.. The network can thus be written as:
where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. Note that combines weights of different layers from the input to the output of the network. Each row of is the gradient of logit with respect to flattened input
and can be handled in auto-grad software such as PyTorch(Paszke et al., 2017). We define:
where denotes the number of classes, denotes the class probabilities, and is the cross-entropy loss function.
In this case, we have the following result:
is given by:
where is a diagonal matrix whose diagonal elements are equal to .
The first observation from Proposition 1 is as follows:
is a positive semidefinite matrix.
These two results allow an extremely efficient computation of the Hessian’s eigenvectors and eigenvalues using the Cholesky decomposition of . See Appendix Section C for full details. Note the use of decomposition is critical as storing the Hessian requires intractable amounts of memory for high dimensional inputs. The entire calculation of the Hessian’s decomposition for ImageNet using a ResNet-50 (He et al., 2016) runs in approximately 4.2 seconds on an NVIDIA GTX 1080 Ti.
To the best of our knowledge, this is the first work which derives the exact Hessian decomposition for piecewise linear networks. Yao et al. 2018 (Yao et al., 2018) also proved the Hessian for piecewise linear networks is at most rank but did not derive the exact input Hessian.
One advantage of having a closed-form formula for the Hessian matrix (6) is that we can use it to properly set the regularization parameter in CASO’s formulation. To do this, we rely on the following result:
If is the largest eigenvalue of , for any value of , the second-order interpretation objective function (5) is strongly concave.
We use Theorem 3 to set the regularization parameter for CASO. We need to set to make the optimization convex, but not set so large that it overpowers . In particular, we set , where we choose . For CAFO, we set
. We estimateusing the power-iteration method. In our experiments, we found that around 10 iterations are sufficient for convergence of the power iteration method.
3.2 Theoretical results on the Hessian impact
In this section, we leverage the exact Hessian calculation to prove that when the probability of predicted class is 1 and number of classes is large, the Hessian of a piecewise linear neural network is approximately of rank one and its eigenvector is approximately parallel to the gradient. Since a constant scaling does not affect the visualization, this causes the two interpretations to be approximately similar to each other.
If the probability of the predicted class=1-(c-1) , where , then as c such that , Hessian is of rank one and its eigenvector is parallel to the gradient.
3.3 Empirical results on the Hessian impact
In this section, we present empirical results on the impact of the second-order loss approximation in deep learning interpretation. In experiments of this section, to isolate the impact of the Hessian term, we assume in both CASO and CAFO optimizations.
A consequence of Theorem 3
is that the gradient descent method with Nesterov momentum converges to the global optimizer of the second-order interpretation objective objective with a convergence rate of(Appendix Section B).
To optimize , the gradient is given by:
The gradient term and the regularization term
are straightforward to implement using standard backpropagation.
To compute the Hessian-vector product term , we rely on the result of Pearlmutter 1994 (Pearlmutter, 1994): a Hessian-vector product can be computed in the same time as the gradient . This is handled easily in modern auto-grad software. Moreover, for ReLU networks, our closed-form formula for the Hessian term (Theorem 1) can be used in computation of the Hessian-vector product as well. In our experiments, we use the closed-form formula for and proximal gradient descent for .
We compare second-order interpretations (CASO with ) and first-order variant (CAFO with ) empirically. Note that when , where is the gradient and is the interpretation obtained using the CAFO objective.
We compute second-order and first-order interpretations for 1000 random samples on the ImageNet ILSVRC-2013 (Russakovsky et al., 2015) validation set using a Resnet-50 (He et al., 2016) model. Our loss function is the cross-entropy loss. After calculating for all methods, the values must be normalized for visualization in a saliency map. We apply a normalization technique from existing work which we describe in Appendix Section D.
We plot the Frobenius norm of the difference between CASO and CAFO in Figure 1. Before taking the difference, we normalize the solutions produced by CASO and CAFO to have the same norm because a constant scaling of elements of does not change the visualization.
The empirical results are consistent with our theoretical results (Figure 1): the second-order interpretation results are similar to the first-order ones when the classification confidence probability is high. However, when the classification confidence probability is small, including the Hessian term can be useful in deep learning interpretation.
To observe the difference between CAFO and CASO interpretations in both regimes qualitatively, we compare them for an image when the confidence probability is high and for one where it is low in Figure 2. When the confidence probability is high, CAFO CASO and when this probability is low, CASO CAFO.
4 Understanding Impact of the group-feature
In this section, we study the impact of the group feature in deep learning interpretation. The group feature has been included as the sparsity constraint in optimization (2).
To obtain an unconstrained concave optimization for the CASO interpretation, we relaxed the sparsity (cardinality) constraint (often called an norm constraint) to a convex norm constraint. Such a relaxation is a core component for popular learning methods such as compressive sensing (Candes & Tao, 2005; Donoho, 2006) or LASSO regression (Tibshirani, 1996). Using results from this literature, we show this relaxation is tight under certain conditions on the Hessian matrix (see Appendix Section E). In other words, the optimal of optimization (5) is sparse with the proper choice of regularization parameters.
One method for optimizing this objective is to apply the gradient descent method used in the second-order interpretation but with the addition of an regularization penalty. In our early experiments, we found that this procedure leads to poor convergence properties in practice. This is partially due to the non-smoothness of the regularization term.
To resolve this issue, we instead use the proximal gradient descent to compute a solution for CAFO and CASO when . Using the Nesterov momentum method and backtracking with proximal gradient descent gives a convergence rate of where is the number of gradient updates (Appendix Section B). Proximal GD has been used in other deep learning problems including adversarial examples as well (e.g. (Chen et al., 2017)).
Below we explain how we use the proximal gradient descent to include the group features in deep learning interpretation. First, we write the objective function as the sum of a smooth and non-smooth function:
Let be the smooth, be the non-smooth part:
The gradient of the smooth objective is given by:
The proximal mapping is given by:
This formula can be understood intuitively as follows. If the magnitude of some elements of is below a certain threshold (), proximal mapping sets those values to zero. This leads to values that are exactly zero in the saliency map. This can be viewed as removing noise by a certain thresholding procedure.
To optimize , we use FISTA (Beck & Teboulle, 2009) with backtracking and the Nesterov momentum optimizer with a learning rate of for 10 iterations and decay factor of . is initialized to zero.
FISTA takes a step with learning rate to reduce the smooth objective loss , then applies a proximal mapping to the resulting . Backtracking reduces the learning rate when the update results in higher loss.
4.1 Impact of group features in interpretation
In this section, our goal is to understand the impact of the group features in deep learning interpretation. In our experiments, we focus on the image classification problem because visual interpretations are intuitive and allow for comparison with prior work. We use a Resnet-50 (He et al., 2016) model on the ImageNet ILSVRC-2013 dataset (Russakovsky et al., 2015).
To gain intuition for the effect of , we show a sweep over values in Figure 3. We observe that when is set too high or too low, the interpretation breaks down as the importance values are relatively constant across the image (all high or all zero).
Different approaches to set the regularization parameter have been explored in different problems. For example, in LASSO, one common approach is to use Least Angle Regression (Efron et al., 2004).
In the deep learning interpretation problem, we propose an unsupervised method based on the sparsity ratio of the interpretation solution to set a proper value for . We define , the sparsity ratio, as the number of zero pixels divided by the total number of pixels. We tune in an unsupervised fashion (since we do not know the ground truth interpretation) by increasing until reaches all zeros. We optimize with = [0, 10, 10, 10, 6.2510, 1.2510, 2.510, 510]. For interpretations with sparsity above a certain threshold (e.g. in our examples), we choose the interpretation with the highest loss on the original model. In practice, we batch different values of to find a reasonable parameter setting efficiently.
5 Qualitative Comparision of Deep learning Interpretation Methods
In this section, we briefly review prior approaches for the deep learning interpretation and compare their performance qualitatively. The proposed Hessian and group feature terms can be potentially included in these approaches as well.
Vanilla Gradient Simonyan et al. 2013 (Simonyan et al., 2014) propose to compute the gradient of the class score with respect to the input.
SmoothGrad Smilkov et al. 2017 (Smilkov et al., 2017) argues that the input gradient may fluctuate sharply in the region local to the test point. To address this, they average gradient-based importance values generated from many noisy versions of the input.
Integrated Gradients Sundararajan et al. 2017 (Sundararajan et al., 2017) define a baseline, which represents an input absent of information (e.g., a completely zero image). Feature importance is determined by accumulating gradient information along the path from the baseline to the original input: . The integral is approximated by a finite sum.
The idea of SmoothGrad (Smilkov et al., 2017) is to “smooth” the saliency map by averaging the importance values generated from many noisy versions of the input thereby smoothing the local fluctuations in the gradient. We use a similar idea to define smooth versions of CASO and CAFO. This yields the following interpretation objective.
Definition 4 (The Smooth CASO Interpretation)
For a given sample , we define the smooth context-aware second-order (the Smooth CASO) importance function as follows:
where and and are defined similarly as before.
In smoothed versions, we average over number of noisy samples with set to . Smooth CAFO is defined similarly without Hessian term.
Since principled quantitative evaluations of saliency maps remain an open problem without properly annotated samples, we focus on some qualitative evaluations of different interpretation methods. Figure 4 shows a comparison between CAFO, CASO and other existing methods while more examples have been presented in Appendix Section G. We observe that including the group-feature in deep learning interpretation leads to a sparse saliency map, helping to eliminate the spurious noise and improving the quality of saliency maps.
In this paper, we studied two aspects of the deep learning interpretation problem. First, by characterizing a closed-form formula for the Hessian matrix of a deep ReLU network, we showed that, if the confidence in the predicted class is high and the number of classes is large, first-order and second-order methods produce similar results. In the process, we also proved that the Hessian matrix is of rank one and its eigenvector is parallel to the gradient. These results can be insightful in other related problems such as adversarial examples. The extent of the Hessian impact for low confidence predictions and/or the case when the number of classes is small are among interesting directions for the future work. Second, we incorporated high-order feature dependencies in deep learning interpretation using a sparsity regularization term. This extension improves the deep learning interpretation significantly.
Although significant progresses have been made in tackling the deep learning interpretation problem, there remain some open problems as well. For example, since saleincy maps are high-dimensional vectors, they can be sensitive to noise and adversarial perturbations. Moreover, due to the lack of properly annotated datasets for the interpretation problem, the evaluation of interpretation methods are often qualitative and can be subjective. Resolving these issues are among interesting directions for the future work.
- Adebayo et al. (2018) Adebayo, J., Gilmer, J., Goodfellow, I., and Kim, B. Local explanation methods for deep neural networks lack sensitivity to parameter values. In ICLR Workshop, 2018.
- Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity Checks for Saliency Maps. arXiv e-prints, October 2018.
- Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. In Proceedings of Advances in Neural Information Processing Systems, 2018.
- Alvarez-Melis & Jaakkola (2018) Alvarez-Melis, D. and Jaakkola, T. S. Towards Robust Interpretability with Self-Explaining Neural Networks. arXiv e-prints, June 2018.
Bach et al. (2015)
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., Samek,
W., and Suárez, Ó. D.
On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation.In PloS one, 2015.
- Beck & Teboulle (2009) Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2009.
- Bickel et al. (2009) Bickel, P. J., Ritov, Y., Tsybakov, A. B., et al. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 2009.
- Candes et al. (2007) Candes, E., Tao, T., et al. The dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics, 2007.
Candes & Tao (2005)
Candes, E. J. and Tao, T.
Decoding by linear programming.IEEE transactions on information theory, 2005.
- Carlini & Wagner (2016) Carlini, N. and Wagner, D. Towards Evaluating the Robustness of Neural Networks. arXiv e-prints, August 2016.
- Chen et al. (2017) Chen, P.-Y., Sharma, Y., Zhang, H., Yi, J., and Hsieh, C.-J. Ead: elastic-net attacks to deep neural networks via adversarial examples. arXiv preprint arXiv:1709.04114, 2017.
- Donoho (2006) Donoho, D. L. Compressed sensing. In IEEE Transactions on Information Theory, 2006.
- Efron et al. (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. Least Angle Regression. arXiv Mathematics e-prints, June 2004.
- Ghorbani et al. (2017) Ghorbani, A., Abid, A., and Zou, J. Y. Interpretation of neural networks is fragile. arXiv preprint arXiv: 1710.10547, 2017.
Glorot et al. (2011)
Glorot, X., Bordes, A., and Bengio, Y.
Deep sparse rectifier neural networks.
Proceedings of Artificial Intelligence and Statistics, 2011.
- Goodfellow et al. (2013) Goodfellow, I. J., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. In Proceedings of the International Conference of Machine Learning, 2013.
- Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv e-prints, December 2014.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
- Kindermans et al. (2016) Kindermans, P.-J., Schütt, K., Müller, K.-R., and Dähne, S. Investigating the influence of noise and distractors on the interpretation of neural networks. In NIPS Workshop on Interpretable Machine Learning in Complex Systems, 2016.
- Kindermans et al. (2017) Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. The (un)reliability of saliency methods. arXiv preprint arXiv: 1711.00867, 2017.
- Li et al. (2016) Li, J., Monroe, W., and Jurafsky, D. Understanding neural networks through representation erasure. arXiv preprint arXiv: 1612.08220, 2016.
- Moosavi-Dezfooli et al. (2015) Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P. DeepFool: a simple and accurate method to fool deep neural networks. arXiv e-prints, November 2015.
- Nie et al. (2018) Nie, W., Zhang, Y., and Patel, A. A theoretical explanation for perplexing behaviors of backpropagation-based visualizations. arXiv preprint arXiv:1805.07039, 2018.
- Parikh & Boyd (2014) Parikh, N. and Boyd, S. Proximal algorithms. Found. Trends Optim., 1(3):127–239, January 2014. ISSN 2167-3888. doi: 10.1561/2400000003. URL http://dx.doi.org/10.1561/2400000003.
- Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NIPS Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, 2017.
- Pearlmutter (1994) Pearlmutter, B. A. Fast exact multiplication by the hessian. In Neural Computation, 1994.
- Raskutti et al. (2010) Raskutti, G., Wainwright, M. J., and Yu, B. Restricted eigenvalue properties for correlated gaussian designs. Journal of Machine Learning Research, 11(Aug):2241–2259, 2010.
- Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 2015.
- Shrikumar et al. (2017) Shrikumar, A., Greenside, P., and Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference of Machine Learning, 2017.
- Simonyan et al. (2014) Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside convolutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations, 2014.
- Smilkov et al. (2017) Smilkov, D., Thorat, N., Kim, B., Viégas, F. B., and Wattenberg, M. SmoothGrad: removing noise by adding noise. arXiv preprint arXiv: 1706.03825, 2017.
- Sundararajan et al. (2017) Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the International Conference of Machine Learning, 2017.
- Tibshirani (1996) Tibshirani, R. Regression shrinkage and selection via the lasso. In Journal of the Royal Statistical Society, 1996.
- Yao et al. (2018) Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessian-based analysis of large batch training and robustness to adversaries. arXiv preprint arXiv:1802.08241, 2018.
Appendix A Proofs
a.1 Proof of Proposition 1
In this section, we derive the closed-form formula for the Hessian of the loss function of a deep ReLU network. Since a ReLU network is piecewise linear, it is locally linear around an input . Thus the logits can be represented as:
where is the input of dimension , are the logits, are the weights, and are the biases of the linear function. In this proof, we use to denote the logits, to denote the class probabilities, to denote the label vector and c to denote the number of classes. Each column of is the gradient of logit with respect to flattened input and can be easily handled in auto-grad software such as PyTorch (Paszke et al., 2017).
Therefore, we have:
Thus we have,
This completes the proof.
a.2 Proof of Theorem 2
To simplify notation, define as in (16). For any arbitrary row of the matrix , we have
Because , by the Gershgorin Circle theorem, we have that all eigenvalues of are positive and is a positive semidefinite matrix. Since is psd, we can write . Using (15):
Hence is a positive semidefinite matrix as well.
a.3 Proof of Theorem 3
The second-order interpretation objective function is given by,
where ( is fixed). Therefore if , is negative definite and is strongly concave.
a.4 Proof of Theorem 4
Let the class probabilities be denoted by , the number of classes by c and the label vector by . We again use and as defined in (14) and (15) respectively. Without loss of generality, assume that the first class is the one with maximum probability.
We assume all other classes have small probability,
Let be an eigenvalue of and be an eigenvector of , then .
Let be the individual components of the eigenvector. The equation can be rewritten in terms of its individual components as follows:
We first consider the case .
Substituting in :
|Dividing by the normalization constant,||(23)|
Now we consider the case ,
Substituting in :
Writing in terms of its eigenvalues and eigenvectors,
Thus, the Hessian is approximately rank one and the gradient is parallel to the Hessian’s only eigenvector.
a.5 Proof of Theorem 5
We use for simplicity (14).
When = 0 in the CASO and CAFO objectives:
|The CASO objective becomes:|
Consider the matrix :
Hence and since scaling does not affect the visualization, the two interpretations are equivalent.
Appendix B Convergence of Gradient Descent to Solve CASO
A consequence of Theorem 3 is that gradient descent converges to the global optimizer of the second-order interpretation objective objective with a convergence rate of . More precisely, we have:
Appendix C Efficient Computation of the Hessian Matrix Using the Cholesky decomposition
Let . Thus, can be re-written as .
Let the SVD of be as the following:
Thus, we can write: