1 Introduction
Convolutional layers continue to form a key component of current neural network designs. Even though the computational demands during the forward evaluation are relatively modest, significant computational resources are needed during training, which typically requires storage of the state variables (activations) and dense operations between the input and output. By using accelerators (e.g. GPUs, TPUs, Inferentia), the arithmetical component of training is met as long as memory usage is controlled.
Unfortunately, restricting memory usage without introducing significant computational overhead remains a challenge and can lead to difficult to manage additional complexity. Examples include (optimal) checkpointing [9, 3], where the state is periodically stored and recomputed during the backward pass, invertible networks [12, 21, 14], where the state can be derived from the output, and certain approximation methods where computations are made with limited precision arithmetic [10]
or where unbiased estimates are made of the gradient using certain approximations
[38, 27], e.g., via randomized automatic differentiation (RAD, [29]) or via direct feedback alignment (DFA, [28, 13, 8]).Our work is based on the premise that exact computations are often not needed, which is an approach advocated in the field of randomized linear algebra [34, 24]
, and more recently in the field parametric machine learning
[29]. There the argument has been made that it is unnecessary to spend computational resources on exact gradients when stochastic optimization is used. A similar argument was used earlier in the context of parameter estimation with partialdifferential equation constraints
[11, 1, 35]. However, contrary to intervening into computational graphs as in RAD, our approach exploits the underlying linear algebra structure exhibited by the gradient of convolutional layers.By means of relatively straightforward algebraic manipulations, we write the gradient with respect to a convolution weight in terms of the matrix trace of the outer product between the convolutional layer input, the backpropagated residual, and a shift. Next, we approximate this trace with an unbiased randomized trace estimation technique [2, 24, 17, 25, 32] for which we prove convergence and derive theoretical error bounds by extending recent theoretical results [7]
. To meet the challenges of training the most popular convolutional neural networks (CNN), we present a randomized probing technique capable of handling multiple input/output channels. We validate our approach on the MNIST and CIFAR10 datasets for which we achieve overall (savings of individual convolutional layers is much larger) network memory savings of at least a factors of
. Our results are reproducible at: Anonymous.2 Theory
To arrive at our lowmemory footprint convolutional layer, we start by casting the action of these layers into a framework that exposes the underlying linear algebra. By doing this, gradients with respect to the convolution weights can be identified as traces of a matrix. By virtue of this identification, these traces can be approximated by randomized trace estimation [2]
, which greatly reduces the memory footprint at negligible or even negative (speedup) computational overhead. We start by deriving expressions for the single channel case, followed by a demonstration that randomized trace estimation leads to unbiased estimates for the gradient with respect to the weights. Next, we justify the use of randomized trace estimation by proving that its validity can be extended to arbitrary matrices. Aside from proving convergence as the number of probing vectors increases, we also provide error bounds before extending the proposed technique to the multichannel case. The latter calls for a new type of probing to minimize crosstalk between the channels via orthogonalization. We derive bounds for the accuracy for this case as well.
Single channel case
Let us start by writing the action of a single channel convolutional layer as follows
(1) 
and are the number of pixels, batchsize, and number of convolution weights ( for a by kernel), respectively. For the weight , the convolutions themselves correspond to applying a circular shift with offset , denoted by
, followed by multiplication with the weight. Given this expression for the action of a singlechannel convolutional layer, expressions for the gradient with respect to weights can easily be derived by using the chain rule and standard linear algebra manipulations
[31]—i.e, we have(2)  
This expression for the gradient with respect to the convolution weights corresponds to computing the trace—i.e., the sum along the diagonal elements denoted by , of the outer product between the residual collected in and the layer’s input , after applying the shift. The latter corresponds to a right circular shift along the columns.
Computing estimates for the trace through the action of matrices—i.e., without access to entries of the diagonal, is common practice in the emerging field of randomized linear algebra [34, 24]. Going back to the seminal work by Hutchinson [17, 25], unbiased matrixfree estimates for the trace of a matrix exist involving probing with random vectors , with the number of probing vectors and with
the identity matrix. Under this assumption, unbiased randomized trace estimates can be derived from
(3) 
By combining (2) with the above unbiased estimator for the trace, we arrive at the following approximation for gradient with respect to the convolution weights:
(4) 
From this expression the memory savings during the forward pass are obvious since , where with . However, convergence rate guarantees were only established under the additional assumption that is positive semidefinite (PSD, [22]). While the outer product we aim to probe here is not necessarily PSD, improving upon recent results by [7], we show that the condition of PSD can be relaxed to asymmetric matrices by a symmetrization procedure that does not change the trace. More precisely, we show in the following proposition that the gradient estimator in (4) is unbiased and converges to the true gradient as with a rate of about (for details of the proof, we refer to the appendix A.2).
Proposition 1.
Let be a square matrix and the probing vectors are i.i.d. Gaussian with
mean and unit variance. Then for any small number
, with probability
, we haveImposing a small probability of failure means the term in the upper bound is large, which implies that neither term in the upper bound is dominating for all the values. Depending on which term is dominant, the range of can be divided into two regimes, the small regime and the large regime. In the small regime, the first term dominates, and the error decays linearly in . In the large regime, the second term dominates and the error decays as the
. The phase transition happens when
is about , whereis known as the effective rank, which reflects the rate of decay of the singular values of
. We see that as increases, the larger the effective rank is, the earlier the phase transition occurs, after which the decay rate of the error will slow down. Before discussing details of the proposed algorithm, let us first extend the above randomized trace estimator to multichannel convolutions.Multichannel case
In general, convolutional layers involve several input and output channels. In that case, the output of the channel can be written as
(5) 
for with the number of input and output channels and the weight between the input at output channel. In this multichannel case, the gradients consist of the single channel gradient for each input/output channel pair, i.e., .
While randomized trace estimation can in principle be applied to each input/output channel pair independently, we propose to treat all channels simultaneously to further improve computational performance and memory use. Let the outer product of the input/output channel be , i.e, , computing means estimating . To save memory, instead of probing each , we probe the stacked matrix
by length probing vectors stored in , and estimate each via the following estimators
(6) 
where extracts the block from the input vector. That is to say, we simply stack the input and residual, yielding matrices of size and whose outer product (i.e. of the in (6)) is no longer necessarily square. To estimate the trace of each subblock, in (6), we (i) probe the full outer product from the right with probing vectors of length ; (ii
) reshape the resulting matrix into a tensor of size
while the probing matrix is shaped into a tensor of size (i.e.,separate each block of ), and (iii) probe each individual block again from the left. This leads to the desired gradient collected in a matrix. We refer to Figure 1, which illustrates this multichannel randomized trace estimation. After (i), we only need to save in memory rather than that leads to a memory reduction by a factor of .Unfortunately, the improved memory use and computational performance boost of the above multichannel probing reduces the accuracy of the randomized trace estimation because of crosstalk amongst the channels. Since this crosstalk is random, the induced error can be reduced by increasing the number of probing vectors , but this will go at the expense of more memory use and increased computation. To avoid this unwanted overhead, we introduce a new type of random probing vectors that minimizes the crosstalk by again imposing but now on the multichannel probing vectors that consist of multiple blocks corresponding to the number of input/output channels.
Explicitly, we draw each , the block of the probing vector, according
(7) 
For different values of , the ’s are drawn independently with a predefined probability of generating a nonzero block. Compared to conventional (Gaussian) probing vectors (see Figure 2 top left), these multichannel probing vectors contain sparse nonzero blocks (see Figure 2 top right), which reduces the crosstalk (juxtapose with second row of Figure 2). It can be shown that crosstalk becomes less when and .
Given probing vectors drawn from (7), we have to modify the scaling factor of the multichannel randomized trace estimator (6) to ensure it is unbiased,
(8) 
where is the number of nonzero columns in block . We proof the following convergence result for this estimator (the proof can be found in appendix A.2).
Theorem 1 (Succinct version).
Let , be the number of probing vectors. For any small number , with probability over , we have for any and ,
where is an absolute constant and and are the numbers of input and output channels.
Theorem 1 provides convergence guarantee for our special multichannel simultaneous probing procedure. Similar to Proposition 1, Theorem 1 in its original form (supplementary material) also has a twophase behaviour. So the discussion under Proposition 1 applies here. For simplification of presentation, we only presented the bound for the large regime in this succinct version. Still, we can see that the error bound for estimating not only depends on the norm of the current block , but also other blocks in that row, which is expected since we simultaneous probe the entire row instead of each block individually for memory efficiency. Admittedly, due to technical difficulties, we can not theoretically show that decreasing the sampling probability decreases the error. Nevertheless, we observe better performance in the numerical experiments.
3 Stochastic optimization with multichannel randomized trace estimation
Given the expressions for the approximate gradient calculations of convolutional layers and bounds on their error, we are now in a position to introduce our algorithm and analyze its performance on stylized examples and the MNIST and CIFAR10 datasets in the Experiment section 4. We will demonstrate that for fixed memory usage the errors in the gradient are of the same order as errors induced by selecting different minbatches. This confirms similar observations made by [29]. We conclude this section by comparing memory usage and speed of an actual neural network.
Lowmemory stochastic backpropagation
The key point of the randomized trace estimator in Equation (8) is that it allows for onthefly compression of the state variables during the forward pass. For a single convolutional layer with input and convolution weights , our approximation involves three simple steps, namely (1) probing of the state variable , (2) matrixfree formation of the outer product , and (3) approximation of the gradient via . These three steps lead to major memory reductions even for a relatively small image size of and . In that case, our approach leads to a memory reduction by a factor of for . For this leads to memory saving. Because the probing vectors are generated on the fly, we only need to allocate memory for during the forward pass as long as we also store the state of the random generator. During backpropagation, we initialize the state, generate the probing vectors, followed by applying a shift and product by . These steps are summarized in Algorithm 3. This simple yet powerful algorithm provides a virtually memory free estimate of the true gradient with respect to its weights.
Forward pass:
1. Forward convolution
2. Draw a new random seed and probing matrix
3. Compute and save
4. Store
Backward pass:
1. Load random seed and probed forward
2. Redraw probing matrix from
3. Compute backward probe
4. Compute gradient
Minibatch versus randomized trace estimation errors
Simply stated, stochastic optimization involves gradients that contain random errors known as gradient noise. As long as this noise is not too large and independent for different gradient calculations, algorithms such as stochastic gradient descent where gradients are computed for randomly drawn minibatches, converge under certain conditions. In addition, the presence of gradient noise helps the algorithm to avoid bad local minima, which arguably leads to better generalization of the trained network [26, 16]. Therefore, as long as the batchsize is not too large, one can expect the trained network to perform well.
We argue that the same applies to stochastic optimization with gradients approximated by (multichannel) randomized trace estimation as long the errors behave similarly. In a setting where memory comes at a premium this means that we can expect training to be successful for gradient noise with similar variability. To this end, we conduct an experiment where the variability of () convolution weights are calculated for the true gradient for different randomly drawn minibatches of size . We do this for a randomly initialized image classification network designed for the CIFAR10 dataset (for network details, see Table 4 in appendix A.3).
For comparison, approximate gradients are also calculated for randomized trace estimates obtained by probing independently ("Indep." in blue), multichannel ("Multi" in orange), and multichannel with orthogonalization ("MultiOrtho" in green). The batchsizes are for a fixed probing size of selected such that the total memory use is the same as for the true gradient calculations. From the plots in Figure 3
, we observe that as expected the independent probing is close to the true gradient followed by the more memory efficient multichannel probing with and without orhogonalization. While all approximate gradient are within the 99% confidence interval, the orhogonalization has a big effect when the gradients are small (see conv3).
To better understand, the interplay between different batchsizes and numbers of probing vectors , we also computed estimates for the standard deviation from randomly drawn minibatches. As expected, the standard deviations of the network weights gradients increase for smaller batchsize and number of probing vectors. Moreover, the variability of the approximates obtained with randomized trace estimation are for the deeper convolutional layers larger for . However, since we can afford larger batchsizes for similar memory usage, we can control the variability for a given memory budget by using a larger batch size.
Overall effective memory savings
Approximate gradient calculations with multichannel randomized trace estimation can lead to significant memory savings within convolutional layers. Because these layers operate in conjunction with other network layers such as ReLU and batchnorms, the overall effective memory savings depend on the ratio of pure convolutional and other layers and on the interaction between them. This is especially important for layers such as ReLU, which rely on the next layer to store the state variable during backpropagation. Unfortunately, that approach no longer works because our lowmemory convolutional layer does not store the state variable. However, this situation can be remedied easily by only keeping track of the signs [29].
To assess what the effective memory savings are of the multichannel trace estimation, we include in Figure 9 layerbylayer comparisons of memory usage for different versions of the popular SqueezeNet [18] and ResNet [15]. The memory use for the conventional implementation is plotted in blue and our implementation in orange. The results indicate that memory savings by a factor of two or more are certainly achievable, which allows for a doubling of the batchsize or increases in the width/depth of the network. As expected, the savings depend on the ratio of CNN versus other layers.
Wallclock benchmarks
Ideally, reducing the memory footprint during training should not go at the expense of computational overhead that slows things down. To ensure this is indeed the case, we implemented the multichannel randomized trace estimation optimized for CPUs in Julia [4]
and for GPUs in PyTorch
[30]. Implementation details and benchmarks are included in appendix A.5.Our extensive benchmarking experiments demonstrate highly competitive performance for both CPUs, against the stateoftheart NNLib [19, 20], and GPUs, against the highly optimized implementation of convolutional layers in CUDA. On CPUs, we even outperform for large images and large batchsizes the standard im2col [5] implementation by up to as long as the number of probing vectors remains relatively small. We observe similar behavior for GPUs, where we remain competitive and even at times outperform highly optimized CuDNN kernels [6] with room for further improvement. In all cases, there is a slight decrease in performance when the number of channels increases. Overall, approximate gradient calculations with multichannel randomized trace estimation substitute expensive convolutions between the input and output channels by a relatively simple combination of matrixfree actions of the outer product on random probing vectors on the right and dense linear matrix operations on the left (cf.(8) and Algorithm 3).
4 Experiments
True  

Even though memory and computational gains of our proposed method can be significant during backpropagation, accuracy of trained networks needs to be verified. To this end, we conduct a number of experiments on the MNIST and CIFAR10 datasets. In these experiments, we vary the batchsize and the number of probing vectors . Implementations both in Julia and Python are evaluated.
MNIST dataset
We start by training two "MNIST networks" (detailed in Table 2 and 3 of appendix A.3 for Julia and PyTorch with training parameters listed in appendix A.4) for varying batchsizes and number of probing vectors . The network test accuracies for the the Julia implementation, where the default convolutional layer implementation is replaced by XConv.jl, are listed in Table 1 for the default implementation and for our implementation where gradients of the convolutional layers are replaced by our approximations. The results show that our lowmemory implementation remains competitive (compare numbers in bold) even for a small number of probing vectors, yielding a memory saving of about .
We obtained the results listed in Table 1 with the ADAM [23] optimization algorithm. In an effort to add robustness when training overparameterized deep neural networks, we switch in the next example to stochastic line searches (SLS, [36]
) that remove the need to set hyperparameters manually. With this algorithm, the line search parameters are set automatically at the cost of an extra gradient calculation. Figure
10shows the test accuracies as a function of the number of epochs, batchsize
and number of probing vectors . Because the randomized trace estimation is unbiased, we observe convergence as increases. Despite relatively large approximation errors for small , we also notice that the induced randomness by our approximate gradient calculations does not adversely affect the line searches. As in the previous example, we achieve competitive results with slight random fluctuations for for all batchsizes, resulting in a reduction in memory use by a factor of about .CIFAR10 dataset
To conclude our empirical validation of approximate gradient calculations with multichannel randomized trace estimation, we train a network on the CIFAR10 dataset. Compared to the previous examples, this is a more challenging larger realistic training problem. To mimic an actual training scenario, memory usage is fixed between the regular gradient, and the approximate gradients obtained by probing independently ("Indep." in blue with ), multichannel ("Multi." in green with ), and multichannel with orthogonalization ("MultiOrtho" in red with ). The batchsize for the approximate gradient examples is increased from to to reflect the smaller memory footprint. Results for the training/testing loss and accuracy are included in Figure 11. The following observations can be made from these plots. First, there is a clear gap between the training/testing loss for the true and approximate gradients. This gap is also present in the training/testing albeit it is relatively small. However, because of doubling the batchsize the runtime for the training is effectively halved.
5 Related work
The continued demand to train larger and larger networks, for tasks such as video compression and classification in 3D, puts pressure on the memory of accelerators (GPUs, etc.), which is in short supply. This memory pressure is exacerbated when training relies on backpropagation that in its mundane form calls for storage of the state variables during the forward pass. To relieve this memory pressure several attempts have been made, ranging from the use of optimal checkpointing [9, 3] to the use of invertible neural networks [12, 21, 14]. While these approaches can reduce the memory footprint during training, they introduce significant computational overhead, algorithmic complexity, and invertible neural network implementations that may lack in expressibility.
Alternatively, people have relied on approximate arithmetic [10], on replacing symmetric backpropagation by direct feedback alignment [28, 13, 8], where random projections on the residual are used, or on approximations of gradients and Jacobians with techniques borrowed from randomized linear algebra [2, 25, 24]. Compared to these other approaches, the latter is capable of producing approximations that are unbiased, a highly desirable feature when training neural networks with stochastic optimization [26]. Instead of randomizing the forward pass in stochastic computational graphs, we propose to approximate gradients by exploiting the special structure of gradients of convolutional layers. This structure allows us to use the relatively simple method of randomized trace estimation to approximate the gradient while reducing the memory footprint significantly. While perhaps less versatile than the recently proposed method of randomized automatic differentiation [29], our approach does not need intervention in the computational graph and acts as a dropin replacement for the 2D and 3D convolutional layers in existing machine learning frameworks.
6 Conclusion and Future work
We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming stateoftheart neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 27, 22. Hardware details for the networks training are described in the appendix.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets? The License (MIT) is in the code repository

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References
 [1] (201208) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
 [2] (201104) Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. J. ACM 58 (2). External Links: ISSN 00045411, Link, Document Cited by: §1, §2, §5.
 [3] (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
 [4] (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
 [5] (200610) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.

[6]
(2014)
cuDNN: Efficient Primitives for Deep Learning
. arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.  [7] (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
 [8] (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662453X Cited by: §1, §5.
 [9] (200003) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 00983500, Link, Document Cited by: §1, §5.
 [10] (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
 [11] (201207) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
 [12] (201712) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
 [13] (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
 [14] (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.

[15]
(2016)
Deep Residual Learning for Image Recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.  [16] (202012 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
 [17] (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics  Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
 [18] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
 [19] (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.

[20]
(2018)
Flux: Elegant Machine Learning with Julia.
Journal of Open Source Software
. External Links: Document, Link Cited by: §3.  [21] (2018) iRevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
 [22] (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
 [23] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
 [24] (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
 [25] (202010) Hutch++: Optimal Stochastic Trace Estimation. arXiv eprints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
 [26] (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
 [27] (201909–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
 [28] (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
 [29] (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
 [30] (2019) PyTorch: An Imperative Style, HighPerformance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
 [31] (200810) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
 [32] (201510) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 16153375, Link, Document Cited by: §1.
 [33] (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
 [34] (201907) Streaming LowRank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.

[35]
(201410)
3D frequencydomain seismic inversion with controlled sloppiness
. SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1. 
[36]
(2019)
Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates
. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4. 
[37]
(201811)
HighDimensional Probability: An Introduction with Applications in Data Science
. Cambridge University Press. Cited by: §A.2.2.  [38] (201917 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.
4 Experiments
True  

Even though memory and computational gains of our proposed method can be significant during backpropagation, accuracy of trained networks needs to be verified. To this end, we conduct a number of experiments on the MNIST and CIFAR10 datasets. In these experiments, we vary the batchsize and the number of probing vectors . Implementations both in Julia and Python are evaluated.
MNIST dataset
We start by training two "MNIST networks" (detailed in Table 2 and 3 of appendix A.3 for Julia and PyTorch with training parameters listed in appendix A.4) for varying batchsizes and number of probing vectors . The network test accuracies for the the Julia implementation, where the default convolutional layer implementation is replaced by XConv.jl, are listed in Table 1 for the default implementation and for our implementation where gradients of the convolutional layers are replaced by our approximations. The results show that our lowmemory implementation remains competitive (compare numbers in bold) even for a small number of probing vectors, yielding a memory saving of about .
We obtained the results listed in Table 1 with the ADAM [23] optimization algorithm. In an effort to add robustness when training overparameterized deep neural networks, we switch in the next example to stochastic line searches (SLS, [36]
) that remove the need to set hyperparameters manually. With this algorithm, the line search parameters are set automatically at the cost of an extra gradient calculation. Figure
10shows the test accuracies as a function of the number of epochs, batchsize
and number of probing vectors . Because the randomized trace estimation is unbiased, we observe convergence as increases. Despite relatively large approximation errors for small , we also notice that the induced randomness by our approximate gradient calculations does not adversely affect the line searches. As in the previous example, we achieve competitive results with slight random fluctuations for for all batchsizes, resulting in a reduction in memory use by a factor of about .CIFAR10 dataset
To conclude our empirical validation of approximate gradient calculations with multichannel randomized trace estimation, we train a network on the CIFAR10 dataset. Compared to the previous examples, this is a more challenging larger realistic training problem. To mimic an actual training scenario, memory usage is fixed between the regular gradient, and the approximate gradients obtained by probing independently ("Indep." in blue with ), multichannel ("Multi." in green with ), and multichannel with orthogonalization ("MultiOrtho" in red with ). The batchsize for the approximate gradient examples is increased from to to reflect the smaller memory footprint. Results for the training/testing loss and accuracy are included in Figure 11. The following observations can be made from these plots. First, there is a clear gap between the training/testing loss for the true and approximate gradients. This gap is also present in the training/testing albeit it is relatively small. However, because of doubling the batchsize the runtime for the training is effectively halved.
5 Related work
The continued demand to train larger and larger networks, for tasks such as video compression and classification in 3D, puts pressure on the memory of accelerators (GPUs, etc.), which is in short supply. This memory pressure is exacerbated when training relies on backpropagation that in its mundane form calls for storage of the state variables during the forward pass. To relieve this memory pressure several attempts have been made, ranging from the use of optimal checkpointing [9, 3] to the use of invertible neural networks [12, 21, 14]. While these approaches can reduce the memory footprint during training, they introduce significant computational overhead, algorithmic complexity, and invertible neural network implementations that may lack in expressibility.
Alternatively, people have relied on approximate arithmetic [10], on replacing symmetric backpropagation by direct feedback alignment [28, 13, 8], where random projections on the residual are used, or on approximations of gradients and Jacobians with techniques borrowed from randomized linear algebra [2, 25, 24]. Compared to these other approaches, the latter is capable of producing approximations that are unbiased, a highly desirable feature when training neural networks with stochastic optimization [26]. Instead of randomizing the forward pass in stochastic computational graphs, we propose to approximate gradients by exploiting the special structure of gradients of convolutional layers. This structure allows us to use the relatively simple method of randomized trace estimation to approximate the gradient while reducing the memory footprint significantly. While perhaps less versatile than the recently proposed method of randomized automatic differentiation [29], our approach does not need intervention in the computational graph and acts as a dropin replacement for the 2D and 3D convolutional layers in existing machine learning frameworks.
6 Conclusion and Future work
We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming stateoftheart neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 27, 22. Hardware details for the networks training are described in the appendix.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets? The License (MIT) is in the code repository

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References
 [1] (201208) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
 [2] (201104) Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. J. ACM 58 (2). External Links: ISSN 00045411, Link, Document Cited by: §1, §2, §5.
 [3] (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
 [4] (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
 [5] (200610) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.

[6]
(2014)
cuDNN: Efficient Primitives for Deep Learning
. arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.  [7] (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
 [8] (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662453X Cited by: §1, §5.
 [9] (200003) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 00983500, Link, Document Cited by: §1, §5.
 [10] (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
 [11] (201207) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
 [12] (201712) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
 [13] (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
 [14] (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.

[15]
(2016)
Deep Residual Learning for Image Recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.  [16] (202012 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
 [17] (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics  Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
 [18] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
 [19] (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.

[20]
(2018)
Flux: Elegant Machine Learning with Julia.
Journal of Open Source Software
. External Links: Document, Link Cited by: §3.  [21] (2018) iRevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
 [22] (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
 [23] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
 [24] (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
 [25] (202010) Hutch++: Optimal Stochastic Trace Estimation. arXiv eprints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
 [26] (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
 [27] (201909–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
 [28] (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
 [29] (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
 [30] (2019) PyTorch: An Imperative Style, HighPerformance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
 [31] (200810) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
 [32] (201510) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 16153375, Link, Document Cited by: §1.
 [33] (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
 [34] (201907) Streaming LowRank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.

[35]
(201410)
3D frequencydomain seismic inversion with controlled sloppiness
. SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1. 
[36]
(2019)
Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates
. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4. 
[37]
(201811)
HighDimensional Probability: An Introduction with Applications in Data Science
. Cambridge University Press. Cited by: §A.2.2.  [38] (201917 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.
5 Related work
The continued demand to train larger and larger networks, for tasks such as video compression and classification in 3D, puts pressure on the memory of accelerators (GPUs, etc.), which is in short supply. This memory pressure is exacerbated when training relies on backpropagation that in its mundane form calls for storage of the state variables during the forward pass. To relieve this memory pressure several attempts have been made, ranging from the use of optimal checkpointing [9, 3] to the use of invertible neural networks [12, 21, 14]. While these approaches can reduce the memory footprint during training, they introduce significant computational overhead, algorithmic complexity, and invertible neural network implementations that may lack in expressibility.
Alternatively, people have relied on approximate arithmetic [10], on replacing symmetric backpropagation by direct feedback alignment [28, 13, 8], where random projections on the residual are used, or on approximations of gradients and Jacobians with techniques borrowed from randomized linear algebra [2, 25, 24]. Compared to these other approaches, the latter is capable of producing approximations that are unbiased, a highly desirable feature when training neural networks with stochastic optimization [26]. Instead of randomizing the forward pass in stochastic computational graphs, we propose to approximate gradients by exploiting the special structure of gradients of convolutional layers. This structure allows us to use the relatively simple method of randomized trace estimation to approximate the gradient while reducing the memory footprint significantly. While perhaps less versatile than the recently proposed method of randomized automatic differentiation [29], our approach does not need intervention in the computational graph and acts as a dropin replacement for the 2D and 3D convolutional layers in existing machine learning frameworks.
6 Conclusion and Future work
We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming stateoftheart neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 27, 22. Hardware details for the networks training are described in the appendix.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets? The License (MIT) is in the code repository

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References
 [1] (201208) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
 [2] (201104) Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. J. ACM 58 (2). External Links: ISSN 00045411, Link, Document Cited by: §1, §2, §5.
 [3] (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
 [4] (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
 [5] (200610) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.

[6]
(2014)
cuDNN: Efficient Primitives for Deep Learning
. arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.  [7] (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
 [8] (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662453X Cited by: §1, §5.
 [9] (200003) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 00983500, Link, Document Cited by: §1, §5.
 [10] (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
 [11] (201207) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
 [12] (201712) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
 [13] (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
 [14] (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.

[15]
(2016)
Deep Residual Learning for Image Recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.  [16] (202012 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
 [17] (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics  Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
 [18] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
 [19] (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.

[20]
(2018)
Flux: Elegant Machine Learning with Julia.
Journal of Open Source Software
. External Links: Document, Link Cited by: §3.  [21] (2018) iRevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
 [22] (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
 [23] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
 [24] (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
 [25] (202010) Hutch++: Optimal Stochastic Trace Estimation. arXiv eprints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
 [26] (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
 [27] (201909–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
 [28] (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
 [29] (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
 [30] (2019) PyTorch: An Imperative Style, HighPerformance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
 [31] (200810) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
 [32] (201510) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 16153375, Link, Document Cited by: §1.
 [33] (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
 [34] (201907) Streaming LowRank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.

[35]
(201410)
3D frequencydomain seismic inversion with controlled sloppiness
. SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1. 
[36]
(2019)
Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates
. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4. 
[37]
(201811)
HighDimensional Probability: An Introduction with Applications in Data Science
. Cambridge University Press. Cited by: §A.2.2.  [38] (201917 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.
6 Conclusion and Future work
We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming stateoftheart neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 27, 22. Hardware details for the networks training are described in the appendix.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets? The License (MIT) is in the code repository

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References
 [1] (201208) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
 [2] (201104) Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. J. ACM 58 (2). External Links: ISSN 00045411, Link, Document Cited by: §1, §2, §5.
 [3] (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
 [4] (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
 [5] (200610) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.

[6]
(2014)
cuDNN: Efficient Primitives for Deep Learning
. arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.  [7] (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
 [8] (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662453X Cited by: §1, §5.
 [9] (200003) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 00983500, Link, Document Cited by: §1, §5.
 [10] (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
 [11] (201207) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
 [12] (201712) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
 [13] (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
 [14] (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.

[15]
(2016)
Deep Residual Learning for Image Recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.  [16] (202012 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
 [17] (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics  Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
 [18] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
 [19] (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.

[20]
(2018)
Flux: Elegant Machine Learning with Julia.
Journal of Open Source Software
. External Links: Document, Link Cited by: §3.  [21] (2018) iRevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
 [22] (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
 [23] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
 [24] (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
 [25] (202010) Hutch++: Optimal Stochastic Trace Estimation. arXiv eprints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
 [26] (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
 [27] (201909–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
 [28] (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
 [29] (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
 [30] (2019) PyTorch: An Imperative Style, HighPerformance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
 [31] (200810) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
 [32] (201510) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 16153375, Link, Document Cited by: §1.
 [33] (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
 [34] (201907) Streaming LowRank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.

[35]
(201410)
3D frequencydomain seismic inversion with controlled sloppiness
. SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1. 
[36]
(2019)
Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates
. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4. 
[37]
(201811)
HighDimensional Probability: An Introduction with Applications in Data Science
. Cambridge University Press. Cited by: §A.2.2.  [38] (201917 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 27, 22. Hardware details for the networks training are described in the appendix.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets? The License (MIT) is in the code repository

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References
 [1] (201208) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
 [2] (201104) Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. J. ACM 58 (2). External Links: ISSN 00045411, Link, Document Cited by: §1, §2, §5.
 [3] (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
 [4] (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
 [5] (200610) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.

[6]
(2014)
cuDNN: Efficient Primitives for Deep Learning
. arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.  [7] (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
 [8] (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662453X Cited by: §1, §5.
 [9] (200003) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 00983500, Link, Document Cited by: §1, §5.
 [10] (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
 [11] (201207) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
 [12] (201712) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
 [13] (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
 [14] (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.

[15]
(2016)
Deep Residual Learning for Image Recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.  [16] (202012 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
 [17] (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics  Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
 [18] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
 [19] (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.

[20]
(2018)
Flux: Elegant Machine Learning with Julia.
Journal of Open Source Software
. External Links: Document, Link Cited by: §3.  [21] (2018) iRevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
 [22] (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
 [23] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
 [24] (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
 [25] (202010) Hutch++: Optimal Stochastic Trace Estimation. arXiv eprints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
 [26] (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
 [27] (201909–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
 [28] (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
 [29] (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
 [30] (2019) PyTorch: An Imperative Style, HighPerformance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
 [31] (200810) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
 [32] (201510) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 16153375, Link, Document Cited by: §1.
 [33] (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
 [34] (201907) Streaming LowRank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.

[35]
(201410)
3D frequencydomain seismic inversion with controlled sloppiness
. SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1. 
[36]
(2019)
Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates
. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4. 
[37]
(201811)
HighDimensional Probability: An Introduction with Applications in Data Science
. Cambridge University Press. Cited by: §A.2.2.  [38] (201917 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.
References
 [1] (201208) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
 [2] (201104) Randomized algorithms for estimating the trace of an implicit symmetric positive semidefinite matrix. J. ACM 58 (2). External Links: ISSN 00045411, Link, Document Cited by: §1, §2, §5.
 [3] (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
 [4] (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
 [5] (200610) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.

[6]
(2014)
cuDNN: Efficient Primitives for Deep Learning
. arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.  [7] (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
 [8] (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662453X Cited by: §1, §5.
 [9] (200003) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 00983500, Link, Document Cited by: §1, §5.
 [10] (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
 [11] (201207) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
 [12] (201712) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
 [13] (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
 [14] (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.

[15]
(2016)
Deep Residual Learning for Image Recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.  [16] (202012 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
 [17] (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics  Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
 [18] (2016) SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
 [19] (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.

[20]
(2018)
Flux: Elegant Machine Learning with Julia.
Journal of Open Source Software
. External Links: Document, Link Cited by: §3.  [21] (2018) iRevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
 [22] (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
 [23] (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 79, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
 [24] (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
 [25] (202010) Hutch++: Optimal Stochastic Trace Estimation. arXiv eprints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
 [26] (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
 [27] (201909–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
 [28] (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
 [29] (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
 [30] (2019) PyTorch: An Imperative Style, HighPerformance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
 [31] (200810) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
 [32] (201510) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 16153375, Link, Document Cited by: §1.
 [33] (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
 [34] (201907) Streaming LowRank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.

[35]
(201410)
3D frequencydomain seismic inversion with controlled sloppiness
. SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1. 
[36]
(2019)
Painless Stochastic Gradient: Interpolation, LineSearch, and Convergence Rates
. In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4. 
[37]
(201811)
HighDimensional Probability: An Introduction with Applications in Data Science
. Cambridge University Press. Cited by: §A.2.2.  [38] (201917 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.
Appendix A Appendix
a.1 Implementation and code availability
For the anonymous submission, we provide the software in a .zip format with author information removed and will replace it by the GitHub repository after review. The directory paper contains the scripts to reproduce the figures. However, this software is intended to be a usable software package rather than only a set of runnable examples that can be easily plugged into existing framework seamlessly. The code is therefore organized to be installed and used as a standard pip and Julia package.
Our probing algorithm is implemented both in Julia, using LinearAlgebra.BLAS on CPU and CUDA.CUBALS on GPU for the linear algebra computations, and in PyTorch using standard linear algebra utilities. The Julia interface is designed so that preexisting networks can be reused as we are overloading rrule (see ChainRulesCore.jl) to switch easily between the conventional true gradient (NNlib.jl) and ours. The PyTorch implementation defines a new layer that can be swapped for the conventional convolutional layer, torch.nn.Conv2d or torch.nn.Conv3d, in any network using the convert_net utility function.
a.2 Proofs of Preposition 1 and Theorem 1
For a square matrix , let be the trace estimator:
where be i.i.d. Gaussian vectors. We now prove the proposition and theorem stated in Section 2.
a.2.1 Proof of Proposition 1
We restate Proposition 1 here.
Proposition 1.
Let be a square matrix. Then for any small number , with probability ,
The proof uses the following result on trace estimation of symmetric matrices.
Lemma 1 (Theorem 5 of [7]).
Let be symmetric. Then
for all .
a.2.2 Preparation lemmas for Theorem 2
Lemma 2.
Let be a square matrix, be random Gaussian vectors for , and all the and are independent of each other. Then for any with probability ,
where is some absolute constant independent of .
proof of Lemma 2.
Set . For each summand, we have
(9) 
where the first equality used the singular value decomposition
, in the second equality, we defined and , which are still Gaussian. In the third equality, we used to denote the diagonal entry of and and to denote the entry of and , respectively. In the last equality, we defined . Since and are i.i.d., so are . And sinceare products of independent subGaussian random variables, they obey the subexponential distribution, i.e.,
where denotes the subexponential norm and the subGaussian norm. We also used the property that there is a constant , such that for any , a Gaussian variable has a sub Gaussian norm , and this property is applied on and who are both variables due to the rotation invariance of Gaussian vectors.
Apply the Bernstein inequality [37] to , we obtain
where is some absolute constant. Letting to be the right hand side probability, the above implies
Comments
There are no comments yet.