I Introduction
^{†}^{†}footnotetext: This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering under Air Force Contract No. FA872105C0002 and/or FA870215D0001. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Assistant Secretary of Defense for Research and Engineering.A pretrained Deep Neural Network (DNN) accepts an input vector
and outputs a vector . Uncertainty from propagates through the DNN resulting in uncertainty in , but there remains a question of exactly how the input uncertainty translates into output uncertainty, as well as the role of model error in that resulting uncertainty. This question tends to come up during confidence scoring in areas such as automatic speech recognition where things like background noise can distort the input signal [1]. More precisely, supposeis the mean of a multivariate normal distribution with covariance matrix
. As the DNN acts nonlinearly on , it is unlikely that the output distribution will be exactly multivariate normal (Gaussian). However, it can be approximated by a Gaussian and modified later if necessary [2]. So, assuming that our output is a multivariate Gaussian with mean , we want to find the output covariance matrix corresponding to the distribution.Previous approaches for propagating the uncertainty include finding closed form solutions and then numerically integrating probability distributions in the number of hidden nodes dimensions
[3], which is unrealistic to compute when the DNN has large or many hidden layers. More recently, Monte Carlo sampling and the unscented transform have been used to take a set of samples from the input distribution, propagate them through the DNN, and approximate the first and second moments of the output distribution from them
[1]. This can be done to propagate through the DNN as a whole or layerbylayer, approximating the activation function with a piecewise exponential
[4]. This method requires sending at least dim (where dim gives the dimension of a vector) samples [5] through the DNN for each input we wish to propagate the error for, which is also computationally expensive. Additionally, current methods only find the error in the output which originated directly from the input, not accounting for the inherent error in the DNN itself. In other words, they assume that the DNN is a perfect model, which is rarely the case. Extended Kalman Filtering (EKF) [6] has already been applied to DNNs, but was done so as a part of the model training process [7]. Using EKF for uncertainty propagation through DNNs, we can replicate the results yielded by current methods with much less computation, and also account for the model error of the DNN.Ii Approach
EKF examines a nonlinear system with a discretized time domain. At each timestep it makes a prediction, using the process noise, control input, and the previous step’s state. Using this prediction along with the observation noise, EKF then estimates the system’s current state along with the accuracy of that estimation. By treating the layers of the DNN as discrete time steps and their values as states, EKF may be applied. Our system then has no control input and only has observation noise in the first layer in the form of
. As such, only the prediction step of EKF need be applied.Take our DNN (Fig. 1)
to have hidden layers, and say is the vector representing the state estimate at layer and is the covariance matrix such that
where is the expected value of . The nonlinear operation that takes us from state to is given by
where , is the matrix such that is the weight of the edge connecting the th node in layer to the th node in layer , and
is the bias vector for layer
.Note that we specifically assume the use of the Rectified Linear Unit (ReLU) as our activation function
. This is wellsuited for EKF, which linearizes about the current state’s estimate since everywhere (except the point where it’s nondifferentiable) in the Taylor expansion of ReLU, all the terms after the linear term are 0 anyway.The process noise of our system comes from the error resulting from the weights and biases of the pretrained DNN and for each layer is represented by . It can be approximated by the sample covariance matrix which is found by taking a sufficiently large data set of inputs (separate from the training and testing data sets) and running them through the DNN so then
where . And equivalently,
Let be the Jacobian matrix such that
In most applications of EKF, finding Jacobians dominates the computation time [8]. Here, however, this is not the case since the ’s can be computed layerbylayer from the weight matrices alongside the ’s.
Finally, we can use the prediction step EKF equations to find the state estimates and covariances for each layer . These are simply
Iteratively applying these until layer results in the output vector and its covariance matrix . can then be used to find the hyperellipsoid centered at for a certain confidence level. Alternatively, assuming the components of are relatively uncorrelated, just ’s main diagonal can be used to find error bars of a certain confidence level for each component of the output vector independently.
Iii Experimental Results
We use the MNIST handwritten digit data [9], where 28x28 pixel input images (Fig. 2) are converted into 784dimensional input vectors where each component is between 0 and 1 and the output vectors are 10dimensional in which each component is nonnegative and represents how likely it is that that digit was the one written. A DNN with 5 hidden layers of 256 nodes each was trained on 50000 images to 92.8% accuracy and another 10000 images were used to compute the covariance matrices. (Fig. 2), another image vector distinct from the training and testing image vectors, whose digit label is 9,
is assigned the diagonal covariance matrix
(so that the components are independent and each has a standard deviation of .05).
Using EKF, and (Fig. 3) are found
and because the dominant terms of are along the diagonal, the components of the predicted
can be approximated to be uncorrelated with variances
’s given by the entries on the main diagonal. Then error bars can be plotted against the predicted values to show the confidence region accounting for the original input uncertainty as well as the error provided by the model itself (Fig. 4).For this specific data set, since the components of the output vector must all be nonnegative, each variance was scaled to be that of a truncated normal distribution on [0,) instead of an unbounded normal distribution and correspondingly, the error bars below 0 were cut off.
Repeating this procedure but without adding the at each layer, so
we get error bars that depend only on the input uncertainty, effectively assuming that the model is perfect. We can test this by taking a sample of 5000 input image vectors where the components are drawn from independent normal distributions of variance .0025 and centered at the components of , finding the model prediction for each of these samples and then comparing our computed standard deviations for each value of the prediction with the sample standard deviations. We find that the EKF method gives a very similar result to that of the Monte Carlo simulation (Fig. 5).
Without assuming a perfect model, it is difficult to test the accuracy of the resulting error bars due to the inherent error of the model, so the actual standard deviations resulting from the single sample cannot be verified. However, the accuracy can be estimated using an aggregate Root Mean Squared Error (RMSE) calculated by inference testing labeled images with the same label as . This RMSE can be compared with the estimated standard deviation calculated by EKF with (Fig. 6). As the EKFestimated standard deviations represent the accumulation of error through all of the layers while the RMSE only indicates the average error in the final layer, the RMSE will generally be less than the EKFestimated standard deviations. The effect of a single hidden layer on the error cannot be directly tested because there is no way of knowing what the output of a hidden layer should be. Additionally, that no single image will correctly serve as the ’typical’ image for a given label makes the RMSE an even rougher approximation of the real standard deviation.
Varying the diagonal entries of and comparing the EKF output to the actual standard deviations (assuming a perfect model) or RMSE (without assuming a perfect model) illustrates the relationship between and under those disjoint hypotheses (Fig. 7). Note that here, the higher variances are used for illustrative purposes only and are not likely to reflect actual usecases as the DNN was trained to expect the components of to strictly be in the range [0,1].
Iv Discussion
Fig. 7 indicates that when assuming a perfect model, higher input error gives higher output error where ReLU doesn’t vanish, and 0 where it does. Additionally, when the input vector component distributions are independent (as assumed in our calculations), the output error plot has the same shape but scales according to the average of the input error. When not assuming a perfect model, the input error plays a very small role in the output error. While the 6 overlapping curves in Fig. 7 are not exactly identical, they only differ from each other by around —. This is because in our model, the F and Q matrix entries were very roughly on the orders of around — and — respectively, so iteratively scaling by F’s and adding Q’s made the ’s tend toward the same values.
Additionally, running this experiment on DNNs with the same topology but trained to different accuracies, we found that the results could be drastically influenced when using a poorly trained model (Fig. 8).
In the model trained to 56.1% accuracy, whose only nonzero prediction value was on digit 4 (as well as other models trained to relatively low accuracies), the variances for some digits are always 0 regardless of input. This is because if the weights or biases are too small, the values of some nodes vanish identically after applying ReLU, zeroing out the F and Q terms there as well.
V Conclusion
When assuming a perfect model, using EKF for uncertainty propagation through a DNN gives results comparable to that of previous methods, but requires fewer and simpler computations which can be performed alongside inference tests. Additionally, EKF provides information in the case of an imperfect model, combining both the input uncertainty and the error from the DNN itself to give a more accurate representation of the total uncertainty of the output. Future work in this area will explore applying EKF to sparse deep neural networks. The methodology of sparsification includes Hessianbased pruning [10, 11], Hebbian pruning [12], matrix decomposition [13], and graph techniques [14, 15, 16, 17, 18], which should be amenable to the EKF approach.
Acknowledgments
The authors wish to acknowledge the following individuals for their contributions and support: William Arcand, Bill Bergeron, David Bestor, Bob Bond, Chansup Byun, Alan Edelman, Vijay Gadepally, Chris Hill, Michael Houle, Matthew Hubbell, Michael Jones, Anna Klein, Charles Leiserson, Dave Martinez, Peter Michaleas, Lauren Milechin, Paul Monticciolo, Julia Mullen, Andrew Prout, Antonio Rosa, Albert Reuther, Siddharth Samsi, and Charles Yee.
References
 [1] A. H. Abdelaziz, S. Watanabe, J. R. Hershey, E. Vincent, and D. Kolossa, “Uncertainty propagation through deep neural networks,” 2015.
 [2] P. H. Garthwaite, J. B. Kadane, and A. O’Hagan, “Statistical methods for eliciting probability distributions,” Journal of the American Statistical Association, vol. 100, no. 470, pp. 680–701, 2005.

[3]
Y. Lee and S.H. Oh, “Input noise immunity of multilayer perceptrons,” pp. 35–43, 1994.
 [4] R. F. Astudillo and J. P. d. S. Neto, “Propagation of uncertainty through multilayer perceptrons for robust automatic speech recognition,” 2011.
 [5] S. J. Julier and J. K. Uhlmann, “Reduced sigma point filters for the propagation of means and covariances through nonlinear transformations,” IEEE, pp. 887–892, 2002.
 [6] ——, “New extension of the kalman filter to nonlinear systems,” International Society for Optics and Photonics, pp. 182–194, 1997.
 [7] S. Haykin, “Kalman filtering and neural networks,” 2004.
 [8] S. J. Julier, J. K. Uhlmann, and H. F. DurrantWhyte, “A new approach for filtering nonlinear systems,” IEEE, pp. 1628–1632, 1995.

[9]
Y. LeCun, “The mnist database of handwritten digits,” 1998.
 [10] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in neural information processing systems, 1990, pp. 598–605.
 [11] B. Hassibi and D. G. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in neural information processing systems, 1993, pp. 164–171.
 [12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[13]
B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Penksy, “Sparse convolutional neural networks,” in
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2015, pp. 806–814.  [14] J. Kepner and J. Gilbert, Graph Algorithms in the Language of Linear Algebra. SIAM, 2011.
 [15] J. Kepner, M. Kumar, J. Moreira, P. Pattnaik, M. Serrano, and H. Tufo, “Enabling massive deep neural networks with the graphblas,” in High Performance Extreme Computing Conference (HPEC). IEEE, 2017.
 [16] J. Kepner, V. Gadepally, H. Jananthan, L. Milechin, and S. Samsi, “Sparse deep neural network exact solutions,” in High Performance Extreme Computing Conference (HPEC). IEEE, 2018.
 [17] M. Kumar, W. Horn, J. Kepner, J. Moreira, and P. Pattnaik, “Ibm power9 and cognitive computing,” IBM Journal of Research and Development, 2018.
 [18] J. V. Kepner and H. Jananthan, “Mathematics of big data: Spreadsheets, databases, matrices, and graphs,” 2018.
 [19] S. J. Julier and J. K. Uhlmann, “A general method for approximating nonlinear transformations of probability distributions,” 1996.
 [20] J. Kepner, V. Gadepally, H. Jananthan, L. Milechin, and S. Samsi, “Sparse deep neural network exact solutions,” 2018.

[21]
T. Amemiya, “Regression analysis when the dependent variable is truncated normal,” pp. 997–1016, 1973.
 [22] R. E. Kalman, “A new approach to linear filtering and prediction problems,” pp. 35–45, 1960.
 [23] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” pp. 82–97, 2012.
Comments
There are no comments yet.