Deep neural network has achieved phenomenal success during the past decade. However, many challenges remain, among which interpretability and the “vanishing gradient” are two of the most well-known and critical problems [13, 30].
It is difficult to interpret how a neural network arrives at its output, especially for deep networks with many layers and an enormous number of parameters. Often times, users view a neural network as a “black box” that magically works. Lack of interpretability is often one of the top concerns [4, 40, 2, 34, 27, 28, 20, 3, 36, 42] among users who have to make real world decisions based on neural networks’ outputs.
When a neural network grows deep and large, many of its neurons inevitably become dormant and stopped contributing to the final output, causing the problem of “vanishing gradient”. As a result, individual training data may only update a small fraction of the neurons, thus prolonging the training process. The vanishing gradient may also cause an optimizer to stop prematurely. Therefore, large neural networks usually require multiple repetitions of training runs, to increase the chance of converging to a good solution.
. One of them is the ReLU activation function and its variants, such as leaky ReLU , noisy ReLU  and SeLU . The ReLU family of activation functions are known for preserving gradients better over deep networks than other types of activation functions. Preserving gradient is also the flagship features in many popular neural network architectures, such as ResNet 17] is a recent breakthrough that is proven effective for preserving gradient in deep networks. Despite these mitigations, the vanishing gradient remains a challenge and continues to impose a practical limit on the size of a neural network, beyond which the training becomes infeasible.
In this paper, we propose a Shapley value 
approach for better interpreting a neural networks’ prediction and improving its training by preserving gradient. Shapley value is a well established method in the cooperative game theory, and it forms a solid theoretical foundation to analyze and address both problems. Due to the great success and ubiquity of the ReLU activation function, we focus our efforts in finding a solution that can be viewed as a variant or approximation to the ReLU activation.
In a typical neural network, a neuron with inputs can be written as: , where is a nonlinear activation function, are the input to the neuron and , are the internal parameters of the neuron.
2.1 Shapley Value Interpretation
Shapley value was originally developed for fairly distributing shared gain or loss of a team among its team members. It has since been widely used in many different fields, such as the allocation of shared sale proceeds of package deals among participating service providers.
If we are interested in the contribution (a.k.a. relevance) of a given input to the output of a neuron, Shapley value is the only correct answer in theory . Shapley value of an input is defined to be the average of its incremental contributions to the output over all possible permutations of s. We use to denote the Shapley value of a neuron’s input , and Shapley value is conservative by construction: .
|= -1||= 2||= -1|
|, ,||-2, 2, -1||0, 2, 0||0||2||-2|
|, ,||-2, -5, -1||0, 0, 0||0||0||0|
|, ,||3, 2, -1||3, 2, 0||-1||3||-2|
|, ,||3, 0, -1||3, 0, 0||0||3||-3|
|, ,||-4, -5, -1||0, 0, 0||0||0||0|
|, ,||-4, 0, -1||0, 0, 0||0||0||0|
Table 1 is an example of computing Shapley value of a neuron with 3 inputs. The column “” and “” are the inputs and outputs of the ReLU when one, two or three inputs are activated in the order specified in the column “Permutation”. In this example, there are 6 possible permutations for 3 inputs, thus the Shapley value for any of the inputs is the average of its incremental contributions over all six possible permutations. It is important to observe that the Shapley value of all 3 inputs are nonzero, even though the ReLU is currently deactivated with an overall output of 0. The reason for non-zero Shapley values is that the neuron could be activated by two combinations of inputs () or in this example.
There is some similarity between the computation of Shapley value and the random dropout neural network 
, in the sense that a random portion of the inputs are removed. However, it is worth pointing out their difference: a random dropout neural network turns off inputs randomly with an independent probability, which leads to much higher chance of having roughly active inputs. In contrast, random permutation works in two steps: first a single uniform random integer between 0 and is drawn; then a random ordering of the inputs is draw and only the first inputs in the permutation are kept on. Random permutation therefore gives equal probability in activating any number of inputs between 0 and , yielding better chance of activating the neuron than random dropouts.
For a generic activation function
, Shapley value can only be evaluated numerically, for example using Monte Carlo simulation. Such a numerical implementation is computationally expensive and not conducive to analysis. Recently, an accurate analytical approximation to the Shapley value of the gain/loss function in the form ofwas discovered and verified in , the same approach can be adapted to ReLU activation function of , resulting in an analytical approximation of:
is the standard normal distribution function, andis the Shapley value of the k-th input.
The relevance of a neuron’s output is defined to be its contribution to the final output of the entire neural network, which is typically a prediction or probability (e.g. in classification problems). If we denote relevance of a neuron’s output as , a simple method to propagate the relevance to the neuron’s inputs is to take advantage of the linearity of Shapley value and multiply a factor to both side of ; and we arrive at the following propagation formula after some rearrangement:
where is the relevance propagation from the neuron’s output to input . Total relevance is conserved between layers as is the sum of from all the connected neurons in the following layer. Since the factor cancels, the layer-wise relevance propagation (LRP) formula (2) is identical for linear and linear + ReLU layers. If is initialized to be the Shapley value of the neuron’s output , then the retains the interpretation of being the (approximated) Shapley value of neuron’s input after applying the LRP in (2). A term with a small can be added to the denominator of (2) to prevent it from vanishing, similar to the -variant formula of .
The output layer of a neural network is often a nonlinear function, such as the softmax for classification. Before we can start the LRP via (2), the relevance of the output layer has to be initialized to its Shapley value, which requires numerical evaluation (such as Monte Carlo) for most output functions, with few notable exceptions such as Linear and Linear+ReLU output layers. The Shapley value of a Linear+ReLU output layer is given by (1).
An implicit assumption behind the LRP formula (2) is that the neuron’s activation are independent from each other111An example of the effect of correlation is to consider two layered neurons that can never activate together, then there should be no relevance propagation through them, but formula (2) does., which generally does not hold across neural network layers. Therefore the Shapley values computed from LRP (2) is only a crude approximation for deep neural networks. A Monte Carlo simulation is required to compute the exact Shapley values of a neural network’s input. However, the LRP (2) has the advantages of being very fast and producing the (approximated) relevance of all hidden layers as well as the input layer in one shot; while a MC approach would require a separate simulation for each neural network layer. In practice, a crude approximation like (2) may often be good enough to give users the intuition and confidence in using neural network’s output.
The LRP formula (2) is similar to the native LRP algorithms given in , but it replaces by . The benefit of such a replacement is rather intuitive by considering the limiting case of for all , in which case individual no longer makes much difference to the output, thus all inputs’ relevance propagation should be approximately equal. By including the , (2) produces more sensible results for this limiting case than the known LRP formulae in the literature.
The approximation (1) offers a straight forward explanation on why the same propagation formula applies to both Linear and Linear+ReLU layers, which is a common feature in existing LRP algorithms. More generic approximations to Shapley values have been developed in [28, 1] for interpreting neural networks, in comparison the analytical approximation in (1) is faster and more convenient for the ReLU layers.
2.2 Shapley Gradient
As shown in Table 1, Shapley values are non-zeros for a neuron with ReLU as long as at least one of the input combinations can activate the neuron, it is much more likely than the neuron being active, which requires a much stronger condition of . This observation motivated the following approach to prevent the neuron’s gradients from vanishing: we use to replace the true gradient of during the back propagation stage of the training. The result of this replacement is similar to a training procedure using random permutations, as mentioned earlier, random permutation is quite different from typical random dropouts.
In mathematical terms, this alternative gradient is an approximation of:
is the full Jacobian matrix and is a matrix with only the diagonal elements of , the is element wise matrix product. The last step is because by construction, thus
is a vector of 1s. Similar approximations are also applied toand for back propagation:
where , and is standard normal distribution density function. The last terms of the first two equations in (2.2) are the contribution to the gradient from the factor, which is usually small in most practical situations and thus can be safely ignored. We subsequently refer to (2.2) as the Shapley gradients.
The training process using Shapley gradients is similar to that of typical neural networks, except that (2.2) are used during back propagation stage for any layers with ReLU activation; the feed forward calculation of the neural network remains unchanged with the ReLU activation. We subsequently use the term “Shapley Linear Unit” (ShapLU) to refer to the training scheme of mixing Shapley gradient in the backward propagation with ReLU activation in the feed forward stage.
Even though Shapley gradient is inconsistent with the ReLU forward function, it is arguably a better choice for training neural networks; as it is more robust to descent towards the average direction of the steepest descent of all possible permutations of a neuron’s inputs.
The main advantage of Shapley gradient is that it is globally continuous and never vanishes. Even when a neuron is deep in the off state with , significant gradient could still flow through when in (1) is large. For example, when a single signal become very strong in either positive or negative direction, the resulting increase in would open the gradient flow. This is a very nice property as it is exactly the right time to update a neuron’s parameters when any input signal is way out of line comparing to its peers. We call this property “attention to exception”. Figure 2 is a numerical illustration of this property, where we vary one signal to a neuron and keep other inputs unchanged. The vertical axis is the factor, which controls the rate of gradient flow from the output to the input of a neuron in (1).
2.3 Shapley Activation
It requires some additional efforts to implement ShapLU in most machine learning frameworks, because a customized gradient function has to be used instead of automatic differentiation. To ease the implementation of Shapley gradient, we set out to construct an activation function whose gradient matches the Shapley gradient, but maintains full consistency between the forward calculation and backward gradient. The downside of such an activation function is that it can only be an approximation to ReLU in the forward calculation.
Observe that in (1), the cross dependency between on when is only through the factor , which is a very smooth function. The cross sensitivities are usually small in practical settings thus can be safely ignored. Therefore, we can construct an (implied) activation function as the sum of all Shapley value s in (1), which is a close approximation to ReLU:
Given the cross sensitivities are usually small, the gradient of (5) closely matches the Shapley gradient in (2.2). We subsequently call (5) the Shapley Activation (SA), which is much easier to implement in existing machine learning frameworks via automatic differentiation.
Unlike typical activation functions that only depends on the aggregation of , the activation function defined in (5) depends on , thus having a much more sophisticated activation profile. Typical activation functions can be plotted on a 2-D chart, but not so for (5). In Figure 2, we instead show a 2-D scatter plots of 1000 samples of (5) against for and
being independent uniform random variables between -1 and 1.
Figure 2 bears some resemblance to leaky ReLU or noisy ReLU, however the resemblance is only superficial. Both leaky ReLU and noisy ReLU are only functions of , and they can have discontinuities in gradient; while the gradient of (5) is globally continuous, which is important for improving training convergence. (5) is also deterministic, the apparent noise in Figure 2 is from the projection of high dimensional inputs to a single scalar . (5) preserves the unique “attention to exception” property and allows significant gradient flow even if the neuron is deeply in the off state.
3 Numerical Results
3.1 Training using Shapley gradient
Though ShapLU and SA are close approximations to each other conceptually, they might exhibit different convergence behaviors when used in practice.
Our first test is to train a fully connected neural network to classify hand written digits using the MNIST data set. We implemented ShapLU and SA in Julia using Flux.jl , which is a flexible machine learning framework that allows customized gradient function to be inconsistent from the forward calculation. In our ShapLU implementation, we neglected the last terms in the first two equations of (2.2) for simplicity and faster execution. The baseline configuration for numerical testing is a fully connected neural network of 784 input (28x28 gray scale image pixels) with two hidden layers of 100 and 50 neurons, and a output layer with 10 neurons and a softmax classifier. Both hidden layers use ReLU activation, and a cross entropy loss function is used for training.
We compared the convergence of training this neural network using 4 epochs of 1,000 unique images with a batch size of 10 and random re-ordering of batches between epochs. The entire training is repeated 10 times with different initialization to obtain the mean and standard deviation of the training accuracy. Stochastic gradient descent (SGD)[35, 19, 7] optimizer with various learning rates(LR) were tested, as well as an adaptive ADAM optimizer [21, 7] with . In this test, the absolute classification accuracy is not the main concern, our focus is instead to compare the relative performance between ReLU, ShapLU and SA (5) under identical settings.
First and foremost, it is remarkable that ShapLU training actually converges. Figure 4 shows the training accuracy convergence using ADAM optimizer, where the standard deviations are shown as color shades. To our best knowledge, ShapLU is the first neural network training scheme where the back propagation uses “inconsistent” gradient from the forward calculation. When such consistency is broken, training usually fails. However, ShapLU outperformed ReLU in convergence using ADAM optimizer or SGD with large learning rate (LR); and they have similar convergence when smaller LR is used with SGD. This result matches our expectation that the continuous and non-vanishing Shapley gradient would lead to smoother and more stable stochastic descent. ShapLU’s continuous gradient works particularly well with ADAM optimizer, resulting in visible improvements over ReLU in convergence speed during the initial phase of training, as shown in Figure 4. The SA performed similarly to ShapLU in this MNIST test, which is not surprising as they are close approximations. The terminal validation accuracy are similar between all three methods, they all converge to about 86% at the end of four epochs, as measured using 10,000 test MNIST images that does not include the 1,000 training images.
We then tested the SA on CIFAR-10 image classification data set 
using Keras custom layer with TensorFlow backend. There are 50,000 training images and 10,000 test images with input shape of (32, 32, 3). The test neural network starts with a input layer of 3072 neurons (32x32x3), then includes three hidden layers of 1024, 512 and 512 neurons, and terminate with a classification layer of 10 neurons. For each hidden layer, there is an activation function of either ReLU or SA, followed by a dropout layer with. A default glorot uniform method was used to initialize the kernel and bias. With nearly 4 million parameters, this MLP is not a trivial neural network. We trained this neural network using different optimizers with a batch size of 128, to compare the performance between ReLU and SA.
|Optimizer (lr)||Shapley Activation (SA)||ReLU|
Figure 4 is the training accuracy using Adam optimizer with , where the color shadows show standard deviations computed from 20 repetitions of identical training runs. Figure 4 shows that SA results in a significant improvement in training accuracy, convergence and stability (i.e, smaller std dev) over those of ReLU. Table 2 is a summary of validation accuracy at the end of training using different optimizers and learning rates [35, 19, 39, 21, 7], where the standard deviation is computed from 8 repetition of identical training runs. Table 2 shows that the Sharpley activation consistently outperforms ReLU in validation accuracy in almost every optimizer configuration, and many by wide margins. The SA tends to perform better with higher learning rates and it shows much less variations in validation accuracy between optimizer types and learning rates, which could be explained by its continuous and non-vanishing Shapley gradient. This example also shows that the SA works well in conjunction with dropout layers. The SA function is very efficient, we observed only a 20% increase in CIFAR-10 training time for the same number of epochs by switching from ReLU to SA.
In addition, we implemented a convolution layer using Shapley Activation (SA) in Keras and TensorFlow and compared the convergence of ReLU and SA using ResNet-20 v2 , which is a state-of-the-art deep neural network configuration with 20 layers of CNN and ResNet. We used the exact same configuration as  for this test, except that we moved the batch normalization after the ReLU activation. The reason for this change is to ensure a fair comparison with SA because we chose to apply batch normalization after the SA in order not to undermine its “attention to exception” property. In our testing, moving batch normalization after ReLU activation results in a small improvements in training and validation accuracy compared to the original set up in .
Figure 5 is the training accuracy and its standard deviation (in color shadows around the line) of ResNet-20 from 8 identical training runs using the CIFAR-10 data set, showing SA has a small but consistent and statistically significant edge in training accuracy over ReLU during the entire training process. The jumps in training accuracy at 80 and 120 epoch are due to scheduled reductions in learning rate. The right panel in Figure 5
is a zoomed view of the training accuracy at later stage of the training. In this test, the terminal validation accuracy of SA and ReLU are both around 92.5% and are not significantly different, presumably because the variations from the selection of validation data set is greater than any real difference in validation accuracy between the ReLU and SA. Nonetheless, because of the faster convergence and better training accuracy, ResNet-20 with SA could be trained using fewer epochs to reach a similar or higher level of training accuracy than ReLU. This example shows that even a state-of-the-art convolution neural network that is highly tuned for ReLU can further improve its training convergence and accuracy by switching from ReLU to SA, without any additional tuning. We do expect SA performance to improve further with careful tuning of its training parameters. This example also shows that SA can be used successfully in conjunction with batch normalization, and leads to overall better results.
These preliminary results validate the theory and benefits of Shapley gradients. The results from our preliminary test suggest that SA performs generally better than ReLU, and by a significant margin in certain large MLP cases. We also believe that the benefit of SA should carry over to other types of neural network architectures and applications, and more studies are needed to quantify its benefits in different network configurations.
3.2 Interpretation using Shapley value
). The neural network in this example has the same configuration as the previous MNIST MLP test, but is fully trained with MNIST data set. The Shapley values of the output softmax layer are computed using a 1000 path Monte Carlo simulation. The gray scale images are the average over 1,000 MNIST images, of input pixel’s positive or negative Shapley values per unit gray scale (i.e.,or ). Given the final output of this neural network is probability, the brightness in the top panel in Figure 7 is therefore proportional to the increase in probability for a given digit if the pixel’s brightness in the input image increases by 1. The bottom panel shows the same for decrease in probability. Thus bright pixels in the top panel of Figure 7 are relevant pixels that increases the likelihood of a given image being classified to certain digit; those under bottom panel are those important pixels that decrease such likelihood.
We also show the result of a sensitivity based interpretation in Figure 7 with identical setup for comparison. The pixel’s brightness of Figure 7 correspond to the average magnitude of the positive (for Approve) or negative (for Reject) gradient to individual pixels of the input image. It is evident that the Shapley value based interpretation is far superior and much more intuitive. Despite being a crude approximation, the interpretation results like Figure 7 is sufficient to give user the much needed comfort and confidence in using neural network results for real world applications.
Based on an accurate analytical approximation to the Shapley value of ReLU, we established a novel and consistent theoretical framework to help address two critical problems in neural networks: interpretability and vanishing gradients. Preliminary numerical tests confirmed improvements in both areas. The same analytical approach can be applied to other activation functions than ReLU, if fast approximations to their Shapley values are known.
It is a new finding that the gradient used for stochastic descent does not have to be consistent with a neural network’s forward calculation. Better training convergence and accuracy could be achieved by breaking such consistency, as shown in the example of ShapLU. Following this general direction, other inconsistent “training gradient” could be formulated to improve the training and/or regulate the parameterization of neural networks.
In our opinion, the Shapley value based approach is promising and future research is needed to fully understand and quantify its effects for different network architectures and applications.
-  (2019) Explaining deep neural networks with a polynomial time algorithm for shapley value approximation. arXiv. Cited by: §2.1.
-  (2016) Explaining predictions of non-linear classifiers in nlp. arXiv preprint arXiv:1606.07298. Cited by: §1.
-  (2017) . arXiv preprint arXiv:1706.07206. Cited by: §1.
-  (2010) How to explain individual classification decisions. Journal of Machine Learning Research 11 (Jun), pp. 1803–1831. Cited by: §1.
-  (1994) Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5 (2), pp. 157–166. Cited by: §1.
-  (2015) Layer-wise relevance propogation for deep neural network architecture. PLOS ONE. Cited by: §2.1, §2.1.
-  (2018) Optimization methods for large-scale machine learning. Siam Review 60 (2), pp. 223–311. Cited by: §3.1, §3.1.
-  (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §1.
-  (2000) Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature. Cited by: §1.
-  (2018) Which neural net architectures give rise to exploding and vanishing gradients?. In Advances in Neural Information Processing Systems, pp. 582–591. Cited by: §1.
Identity mappings in deep residual networks.
European conference on computer vision. Cited by: §3.1.
-  (2015) Deep residual learning for image recognition. arXiv. Cited by: §1.
-  (2001) Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In A field guide to dynamical recurrent neural networks, Cited by: §1.
-  (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6 (02), pp. 107–116. Cited by: §1.
-  (2018) Overcoming the vanishing gradient problem in plain recurrent networks. arXiv preprint arXiv:1801.06105. Cited by: §1.
-  (2018) Flux: elegant machine learning with julia. Journal of Open Source Software. External Links: Cited by: §3.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv. Cited by: §1.
-  (2015) An empirical exploration of recurrent network architectures. In International Conference on Machine Learning, pp. 2342–2350. Cited by: §1.
Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics 23 (3), pp. 462–466. Cited by: §3.1, §3.1.
-  (2017) Learning how to explain neural networks: patternnet and patternattribution. arXiv preprint arXiv:1705.05598. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1, §3.1.
-  (2017) Self-normalizing neural networks. arXiv. Cited by: §1.
-  (2017) Hexpo: a vanishing-proof activation function. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2562–2567. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §3.1.
The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: §3.1.
-  (2019) Reduced form capital optimization. arXiv. Cited by: §2.1.
-  (2016) The mythos of model interpretability. arXiv preprint arXiv:1606.03490. Cited by: §1.
-  (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, pp. 4765–4774. Cited by: §1, §2.1.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. Proceedings of the 30th International Conference on Machine Learning. Cited by: §1.
-  (2017) Methods for interpreting and understanding deep neural networks. arXiv. Cited by: §1.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research. Cited by: §2.1.
Understanding the exploding gradient problem. CoRR, abs/1211.5063 2. Cited by: §1.
-  (2013) On the difficulty of training recurrent neural networks. In International conference on machine learning, pp. 1310–1318. Cited by: §1.
-  (2016) Why should i trust you?: explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 1135–1144. Cited by: §1.
-  (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §3.1, §3.1.
-  (2017) . arXiv preprint arXiv:1708.08296. Cited by: §1.
-  (1953) A value for n-person games. Annals of Mathematical Studies. Cited by: §1, §2.1.
-  (2019) Keras examples. Note: https://github.com/keras-team/keras/blob/master/examples/cifar10_resnet.py Cited by: §3.1.
-  (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §3.1.
-  (2012) Making machine learning models interpretable.. In ESANN, Vol. 12, pp. 163–172. Cited by: §1.
-  (2010) Rectified linear units improve restricted boltzmann machines. Proceedings of the 30th International Conference on Machine Learning. Cited by: §1.
Interpretable convolutional neural networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8827–8836. Cited by: §1.