Low-memory stochastic backpropagation with multi-channel randomized trace estimation

06/13/2021
by   Mathias Louboutin, et al.
0

Thanks to the combination of state-of-the-art accelerators and highly optimized open software frameworks, there has been tremendous progress in the performance of deep neural networks. While these developments have been responsible for many breakthroughs, progress towards solving large-scale problems, such as video encoding and semantic segmentation in 3D, is hampered because access to on-premise memory is often limited. Instead of relying on (optimal) checkpointing or invertibility of the network layers – to recover the activations during backpropagation – we propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique. Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint. Even though the randomized trace estimation introduces stochasticity during training, we argue that this is of little consequence as long as the induced errors are of the same order as errors in the gradient due to the use of stochastic gradient descent. We discuss the performance of networks trained with stochastic backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/31/2022

Memory-Efficient Backpropagation through Large Linear Layers

In modern neural networks like Transformers, linear layers require signi...
03/22/2022

Constrained Parameter Inference as a Principle for Learning

Learning in biological and artificial neural networks is often framed as...
01/18/2022

Enabling wave-based inversion on GPUs with randomized trace estimation

By building on recent advances in the use of randomized trace estimation...
10/12/2018

Dynamic Channel Pruning: Feature Boosting and Suppression

Making deep convolutional neural networks more accurate typically comes ...
04/01/2021

Ultra-low memory seismic inversion with randomized trace estimation

Inspired by recent work on extended image volumes that lays the ground f...
06/10/2016

Memory-Efficient Backpropagation Through Time

We propose a novel approach to reduce memory consumption of the backprop...
05/22/2018

Backpropagation for long sequences: beyond memory constraints with constant overheads

Naive backpropagation through time has a memory footprint that grows lin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional layers continue to form a key component of current neural network designs. Even though the computational demands during the forward evaluation are relatively modest, significant computational resources are needed during training, which typically requires storage of the state variables (activations) and dense operations between the input and output. By using accelerators (e.g. GPUs, TPUs, Inferentia), the arithmetical component of training is met as long as memory usage is controlled.

Unfortunately, restricting memory usage without introducing significant computational overhead remains a challenge and can lead to difficult to manage additional complexity. Examples include (optimal) checkpointing [9, 3], where the state is periodically stored and recomputed during the backward pass, invertible networks [12, 21, 14], where the state can be derived from the output, and certain approximation methods where computations are made with limited precision arithmetic [10]

or where unbiased estimates are made of the gradient using certain approximations 

[38, 27], e.g., via randomized automatic differentiation (RAD, [29]) or via direct feedback alignment (DFA, [28, 13, 8]).

Our work is based on the premise that exact computations are often not needed, which is an approach advocated in the field of randomized linear algebra [34, 24]

, and more recently in the field parametric machine learning 

[29]

. There the argument has been made that it is unnecessary to spend computational resources on exact gradients when stochastic optimization is used. A similar argument was used earlier in the context of parameter estimation with partial-differential equation constraints 

[11, 1, 35]. However, contrary to intervening into computational graphs as in RAD, our approach exploits the underlying linear algebra structure exhibited by the gradient of convolutional layers.

By means of relatively straightforward algebraic manipulations, we write the gradient with respect to a convolution weight in terms of the matrix trace of the outer product between the convolutional layer input, the backpropagated residual, and a shift. Next, we approximate this trace with an unbiased randomized trace estimation technique [2, 24, 17, 25, 32] for which we prove convergence and derive theoretical error bounds by extending recent theoretical results [7]

. To meet the challenges of training the most popular convolutional neural networks (CNN), we present a randomized probing technique capable of handling multiple input/output channels. We validate our approach on the MNIST and CIFAR10 datasets for which we achieve overall (savings of individual convolutional layers is much larger) network memory savings of at least a factors of

. Our results are reproducible at: Anonymous.

2 Theory

To arrive at our low-memory footprint convolutional layer, we start by casting the action of these layers into a framework that exposes the underlying linear algebra. By doing this, gradients with respect to the convolution weights can be identified as traces of a matrix. By virtue of this identification, these traces can be approximated by randomized trace estimation [2]

, which greatly reduces the memory footprint at negligible or even negative (speedup) computational overhead. We start by deriving expressions for the single channel case, followed by a demonstration that randomized trace estimation leads to unbiased estimates for the gradient with respect to the weights. Next, we justify the use of randomized trace estimation by proving that its validity can be extended to arbitrary matrices. Aside from proving convergence as the number of probing vectors increases, we also provide error bounds before extending the proposed technique to the multi-channel case. The latter calls for a new type of probing to minimize cross-talk between the channels via orthogonalization. We derive bounds for the accuracy for this case as well.

Single channel case

Let us start by writing the action of a single channel convolutional layer as follows

(1)

and are the number of pixels, batchsize, and number of convolution weights ( for a by kernel), respectively. For the weight , the convolutions themselves correspond to applying a circular shift with offset , denoted by

, followed by multiplication with the weight. Given this expression for the action of a single-channel convolutional layer, expressions for the gradient with respect to weights can easily be derived by using the chain rule and standard linear algebra manipulations 

[31]—i.e, we have

(2)

This expression for the gradient with respect to the convolution weights corresponds to computing the trace—i.e., the sum along the diagonal elements denoted by , of the outer product between the residual collected in and the layer’s input , after applying the shift. The latter corresponds to a right circular shift along the columns.

Computing estimates for the trace through the action of matrices—i.e., without access to entries of the diagonal, is common practice in the emerging field of randomized linear algebra [34, 24]. Going back to the seminal work by Hutchinson [17, 25], unbiased matrix-free estimates for the trace of a matrix exist involving probing with random vectors , with the number of probing vectors and with

the identity matrix. Under this assumption, unbiased randomized trace estimates can be derived from

(3)

By combining (2) with the above unbiased estimator for the trace, we arrive at the following approximation for gradient with respect to the convolution weights:

(4)

From this expression the memory savings during the forward pass are obvious since , where with . However, convergence rate guarantees were only established under the additional assumption that is positive semi-definite (PSD, [22]). While the outer product we aim to probe here is not necessarily PSD, improving upon recent results by [7], we show that the condition of PSD can be relaxed to asymmetric matrices by a symmetrization procedure that does not change the trace. More precisely, we show in the following proposition that the gradient estimator in (4) is unbiased and converges to the true gradient as with a rate of about (for details of the proof, we refer to the appendix A.2).

Proposition 1.

Let be a square matrix and the probing vectors are i.i.d. Gaussian with

mean and unit variance. Then for any small number

, with probability

, we have

Imposing a small probability of failure means the term in the upper bound is large, which implies that neither term in the upper bound is dominating for all the values. Depending on which term is dominant, the range of can be divided into two regimes, the small regime and the large regime. In the small regime, the first term dominates, and the error decays linearly in . In the large regime, the second term dominates and the error decays as the

. The phase transition happens when

is about , where

is known as the effective rank, which reflects the rate of decay of the singular values of

. We see that as increases, the larger the effective rank is, the earlier the phase transition occurs, after which the decay rate of the error will slow down. Before discussing details of the proposed algorithm, let us first extend the above randomized trace estimator to multi-channel convolutions.

Multi-channel case

In general, convolutional layers involve several input and output channels. In that case, the output of the channel can be written as

(5)

for with the number of input and output channels and the weight between the input at output channel. In this multi-channel case, the gradients consist of the single channel gradient for each input/output channel pair, i.e., .

While randomized trace estimation can in principle be applied to each input/output channel pair independently, we propose to treat all channels simultaneously to further improve computational performance and memory use. Let the outer product of the input/output channel be , i.e, , computing means estimating . To save memory, instead of probing each , we probe the stacked matrix

by length probing vectors stored in , and estimate each via the following estimators

(6)

where extracts the block from the input vector. That is to say, we simply stack the input and residual, yielding matrices of size and whose outer product (i.e. of the in (6)) is no longer necessarily square. To estimate the trace of each sub-block, in (6), we (i) probe the full outer product from the right with probing vectors of length ; (ii

) reshape the resulting matrix into a tensor of size

while the probing matrix is shaped into a tensor of size (i.e.,separate each block of ), and (iii) probe each individual block again from the left. This leads to the desired gradient collected in a matrix. We refer to Figure 1, which illustrates this multi-channel randomized trace estimation. After (i), we only need to save in memory rather than that leads to a memory reduction by a factor of .

Unfortunately, the improved memory use and computational performance boost of the above multi-channel probing reduces the accuracy of the randomized trace estimation because of crosstalk amongst the channels. Since this cross-talk is random, the induced error can be reduced by increasing the number of probing vectors , but this will go at the expense of more memory use and increased computation. To avoid this unwanted overhead, we introduce a new type of random probing vectors that minimizes the crosstalk by again imposing but now on the multi-channel probing vectors that consist of multiple blocks corresponding to the number of input/output channels.

Explicitly, we draw each , the block of the probing vector, according

(7)

For different values of , the ’s are drawn independently with a predefined probability of generating a nonzero block. Compared to conventional (Gaussian) probing vectors (see Figure 2 top left), these multi-channel probing vectors contain sparse non-zero blocks (see Figure 2 top right), which reduces the crosstalk (juxtapose with second row of Figure 2). It can be shown that crosstalk becomes less when and .

Given probing vectors drawn from (7), we have to modify the scaling factor of the multi-channel randomized trace estimator (6) to ensure it is unbiased,

(8)

where is the number of non-zero columns in block . We proof the following convergence result for this estimator (the proof can be found in appendix A.2).

Theorem 1 (Succinct version).

Let , be the number of probing vectors. For any small number , with probability over , we have for any and ,

where is an absolute constant and and are the numbers of input and output channels.

Theorem 1 provides convergence guarantee for our special multi-channel simultaneous probing procedure. Similar to Proposition 1, Theorem 1 in its original form (supplementary material) also has a two-phase behaviour. So the discussion under Proposition 1 applies here. For simplification of presentation, we only presented the bound for the large regime in this succinct version. Still, we can see that the error bound for estimating not only depends on the norm of the current block , but also other blocks in that row, which is expected since we simultaneous probe the entire row instead of each block individually for memory efficiency. Admittedly, due to technical difficulties, we can not theoretically show that decreasing the sampling probability decreases the error. Nevertheless, we observe better performance in the numerical experiments.

Figure 1: Multi-channel randomized trace estimation. This figure shows the three steps of the algorithm to estimate the trace of a sub block of the outer product. Figure 2: Probing matrices and corresponding approximation of the identity for , and . We can see that the orthogonalization step by zeroing out blocks (compare plots in top row) leads to a near block diagonal approximation of the identity with much less cross-talk between the different channels (compare plots in second row).

3 Stochastic optimization with multi-channel randomized trace estimation

Given the expressions for the approximate gradient calculations of convolutional layers and bounds on their error, we are now in a position to introduce our algorithm and analyze its performance on stylized examples and the MNIST and CIFAR10 datasets in the Experiment section 4. We will demonstrate that for fixed memory usage the errors in the gradient are of the same order as errors induced by selecting different min-batches. This confirms similar observations made by [29]. We conclude this section by comparing memory usage and speed of an actual neural network.

Low-memory stochastic backpropagation

The key point of the randomized trace estimator in Equation (8) is that it allows for on-the-fly compression of the state variables during the forward pass. For a single convolutional layer with input and convolution weights , our approximation involves three simple steps, namely (1) probing of the state variable , (2) matrix-free formation of the outer product , and (3) approximation of the gradient via . These three steps lead to major memory reductions even for a relatively small image size of and . In that case, our approach leads to a memory reduction by a factor of for . For this leads to memory saving. Because the probing vectors are generated on the fly, we only need to allocate memory for during the forward pass as long as we also store the state of the random generator. During backpropagation, we initialize the state, generate the probing vectors, followed by applying a shift and product by . These steps are summarized in Algorithm 3. This simple yet powerful algorithm provides a virtually memory free estimate of the true gradient with respect to its weights.

Forward pass:
1. Forward convolution 
2. Draw a new random seed  and probing matrix 
3. Compute and save  
4. Store 
Backward pass:
1. Load random seed  and probed forward 
2. Redraw probing matrix  from 
3. Compute backward probe 
4. Compute gradient 

Algorithm 1 Low-memory approximate gradient convolutional layer. The random seed and random probing matrix are independently redrawn for each layer and training iteration.
Minibatch versus randomized trace estimation errors

Simply stated, stochastic optimization involves gradients that contain random errors known as gradient noise. As long as this noise is not too large and independent for different gradient calculations, algorithms such as stochastic gradient descent where gradients are computed for randomly drawn minibatches, converge under certain conditions. In addition, the presence of gradient noise helps the algorithm to avoid bad local minima, which arguably leads to better generalization of the trained network [26, 16]. Therefore, as long as the batchsize is not too large, one can expect the trained network to perform well.

We argue that the same applies to stochastic optimization with gradients approximated by (multi-channel) randomized trace estimation as long the errors behave similarly. In a setting where memory comes at a premium this means that we can expect training to be successful for gradient noise with similar variability. To this end, we conduct an experiment where the variability of () convolution weights are calculated for the true gradient for different randomly drawn minibatches of size . We do this for a randomly initialized image classification network designed for the CIFAR10 dataset (for network details, see Table 4 in appendix A.3).

For comparison, approximate gradients are also calculated for randomized trace estimates obtained by probing independently ("Indep." in blue), multi-channel ("Multi" in orange), and multi-channel with orthogonalization ("Multi-Ortho" in green). The batchsizes are for a fixed probing size of selected such that the total memory use is the same as for the true gradient calculations. From the plots in Figure 3

, we observe that as expected the independent probing is close to the true gradient followed by the more memory efficient multi-channel probing with and without orhogonalization. While all approximate gradient are within the 99% confidence interval, the orhogonalization has a big effect when the gradients are small (see conv3).

Figure 3: Randomized trace estimation of the gradient of our randomly initialized CNN for the CIFAR10 dataset. While gradient noise is present, its magnitude is reduced by the orthogonalization when weights are small. Figure 4: Standard deviation of the gradients w.r.t the weights for each of the four convolutional layers in the neural network. The standard deviation is computed over 40 mini-batches randomly drawn from the CIFAR10 dataset.

To better understand, the interplay between different batchsizes and numbers of probing vectors , we also computed estimates for the standard deviation from randomly drawn minibatches. As expected, the standard deviations of the network weights gradients increase for smaller batchsize and number of probing vectors. Moreover, the variability of the approximates obtained with randomized trace estimation are for the deeper convolutional layers larger for . However, since we can afford larger batchsizes for similar memory usage, we can control the variability for a given memory budget by using a larger batch size.

Overall effective memory savings

Approximate gradient calculations with multi-channel randomized trace estimation can lead to significant memory savings within convolutional layers. Because these layers operate in conjunction with other network layers such as ReLU and batchnorms, the overall effective memory savings depend on the ratio of pure convolutional and other layers and on the interaction between them. This is especially important for layers such as ReLU, which rely on the next layer to store the state variable during backpropagation. Unfortunately, that approach no longer works because our low-memory convolutional layer does not store the state variable. However, this situation can be remedied easily by only keeping track of the signs [29].

To assess what the effective memory savings are of the multi-channel trace estimation, we include in Figure 9 layer-by-layer comparisons of memory usage for different versions of the popular SqueezeNet [18] and ResNet [15]. The memory use for the conventional implementation is plotted in blue and our implementation in orange. The results indicate that memory savings by a factor of two or more are certainly achievable, which allows for a doubling of the batchsize or increases in the width/depth of the network. As expected, the savings depend on the ratio of CNN versus other layers.

(a)
(b)
(c)
(d)
Figure 9: Network memory usage for a single gradient. We show the memory usage for known networks for low and high probing sizes for a fixed input size.
Wall-clock benchmarks

Ideally, reducing the memory footprint during training should not go at the expense of computational overhead that slows things down. To ensure this is indeed the case, we implemented the multi-channel randomized trace estimation optimized for CPUs in Julia [4]

and for GPUs in PyTorch 

[30]. Implementation details and benchmarks are included in appendix A.5.

Our extensive benchmarking experiments demonstrate highly competitive performance for both CPUs, against the state-of-the-art NNLib [19, 20], and GPUs, against the highly optimized implementation of convolutional layers in CUDA. On CPUs, we even outperform for large images and large batchsizes the standard im2col [5] implementation by up to as long as the number of probing vectors remains relatively small. We observe similar behavior for GPUs, where we remain competitive and even at times outperform highly optimized CuDNN kernels [6] with room for further improvement. In all cases, there is a slight decrease in performance when the number of channels increases. Overall, approximate gradient calculations with multi-channel randomized trace estimation substitute expensive convolutions between the input and output channels by a relatively simple combination of matrix-free actions of the outer product on random probing vectors on the right and dense linear matrix operations on the left (cf.(8) and Algorithm 3).

4 Experiments

True
Table 1: Training accuracy for varying batchsizes and number of probing vectors on the MNIST dataset.

Even though memory and computational gains of our proposed method can be significant during backpropagation, accuracy of trained networks needs to be verified. To this end, we conduct a number of experiments on the MNIST and CIFAR10 datasets. In these experiments, we vary the batchsize and the number of probing vectors . Implementations both in Julia and Python are evaluated.

MNIST dataset

We start by training two "MNIST networks" (detailed in Table 2 and 3 of appendix A.3 for Julia and PyTorch with training parameters listed in appendix A.4) for varying batchsizes and number of probing vectors . The network test accuracies for the the Julia implementation, where the default convolutional layer implementation is replaced by XConv.jl, are listed in Table 1 for the default implementation and for our implementation where gradients of the convolutional layers are replaced by our approximations. The results show that our low-memory implementation remains competitive (compare numbers in bold) even for a small number of probing vectors, yielding a memory saving of about .

We obtained the results listed in Table 1 with the ADAM [23] optimization algorithm. In an effort to add robustness when training overparameterized deep neural networks, we switch in the next example to stochastic line searches (SLS, [36]

) that remove the need to set hyperparameters manually. With this algorithm, the line search parameters are set automatically at the cost of an extra gradient calculation. Figure 

10

shows the test accuracies as a function of the number of epochs, batchsize

and number of probing vectors . Because the randomized trace estimation is unbiased, we observe convergence as increases. Despite relatively large approximation errors for small , we also notice that the induced randomness by our approximate gradient calculations does not adversely affect the line searches. As in the previous example, we achieve competitive results with slight random fluctuations for for all batchsizes, resulting in a reduction in memory use by a factor of about .

Figure 10: Training for varying batchsizes and probing sizes. This experiment ran with the Stochastic Line Search algorithm (SLS, [36])
CIFAR10 dataset

To conclude our empirical validation of approximate gradient calculations with multi-channel randomized trace estimation, we train a network on the CIFAR10 dataset. Compared to the previous examples, this is a more challenging larger realistic training problem. To mimic an actual training scenario, memory usage is fixed between the regular gradient, and the approximate gradients obtained by probing independently ("Indep." in blue with ), multi-channel ("Multi." in green with ), and multi-channel with orthogonalization ("Multi-Ortho" in red with ). The batchsize for the approximate gradient examples is increased from to to reflect the smaller memory footprint. Results for the training/testing loss and accuracy are included in Figure 11. The following observations can be made from these plots. First, there is a clear gap between the training/testing loss for the true and approximate gradients. This gap is also present in the training/testing albeit it is relatively small. However, because of doubling the batchsize the runtime for the training is effectively halved.

Figure 11: CIFAR10 training with equivalent memory comparing our approximate method to standard training. The four panels show the training and test loss (top row) and training and test accuracy (bottom row) after 100 epochs.

5 Related work

The continued demand to train larger and larger networks, for tasks such as video compression and classification in 3D, puts pressure on the memory of accelerators (GPUs, etc.), which is in short supply. This memory pressure is exacerbated when training relies on backpropagation that in its mundane form calls for storage of the state variables during the forward pass. To relieve this memory pressure several attempts have been made, ranging from the use of optimal checkpointing [9, 3] to the use of invertible neural networks [12, 21, 14]. While these approaches can reduce the memory footprint during training, they introduce significant computational overhead, algorithmic complexity, and invertible neural network implementations that may lack in expressibility.

Alternatively, people have relied on approximate arithmetic [10], on replacing symmetric backpropagation by direct feedback alignment [28, 13, 8], where random projections on the residual are used, or on approximations of gradients and Jacobians with techniques borrowed from randomized linear algebra [2, 25, 24]. Compared to these other approaches, the latter is capable of producing approximations that are unbiased, a highly desirable feature when training neural networks with stochastic optimization [26]. Instead of randomizing the forward pass in stochastic computational graphs, we propose to approximate gradients by exploiting the special structure of gradients of convolutional layers. This structure allows us to use the relatively simple method of randomized trace estimation to approximate the gradient while reducing the memory footprint significantly. While perhaps less versatile than the recently proposed method of randomized automatic differentiation [29], our approach does not need intervention in the computational graph and acts as a drop-in replacement for the 2D and 3D convolutional layers in existing machine learning frameworks.

6 Conclusion and Future work

We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming state-of-the-art neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

    2. Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 2722. Hardware details for the networks training are described in the appendix.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? The License (MIT) is in the code repository

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References

  • [1] A. Y. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen (2012-08) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
  • [2] H. Avron and S. Toledo (2011-04) Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58 (2). External Links: ISSN 0004-5411, Link, Document Cited by: §1, §2, §5.
  • [3] O. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
  • [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
  • [5] K. Chellapilla, S. Puri, and P. Simard (2006-10) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.
  • [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer (2014)

    cuDNN: Efficient Primitives for Deep Learning

    .
    arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.
  • [7] A. Cortinovis and D. Kressner (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
  • [8] C. Frenkel, M. Lefebvre, and D. Bol (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662-453X Cited by: §1, §5.
  • [9] A. Griewank and A. Walther (2000-03) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 0098-3500, Link, Document Cited by: §1, §5.
  • [10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
  • [11] E. Haber, M. Chung, and F. J. Herrmann (2012-07) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
  • [12] E. Haber and L. Ruthotto (2017-12) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
  • [13] D. Han and H. Yoo (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
  • [14] T. Hascoet, Q. Febvre, Y. Ariki, and T. Takiguchi (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.
  • [16] W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang, and T. Goldstein (2020-12 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
  • [17] M.F. Hutchinson (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
  • [18] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
  • [19] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.
  • [20] M. Innes (2018) Flux: Elegant Machine Learning with Julia.

    Journal of Open Source Software

    .
    External Links: Document, Link Cited by: §3.
  • [21] J. Jacobsen, A. W.M. Smeulders, and E. Oyallon (2018) i-RevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • [22] B. Kaperick (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
  • [24] P. Martinsson and J. A. Tropp (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
  • [25] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff (2020-10) Hutch++: Optimal Stochastic Trace Estimation. arXiv e-prints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
  • [26] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
  • [27] A. Nøkland and L. H. Eidnes (2019-09–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
  • [28] A. Nøkland (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
  • [29] D. Oktay, N. McGreivy, J. Aduol, A. Beatson, and R. P. Adams (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
  • [31] K. B. Petersen and M. S. Pedersen (2008-10) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
  • [32] F. Roosta-Khorasani and U. Ascher (2015-10) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 1615-3375, Link, Document Cited by: §1.
  • [33] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, and F. Krzakala (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
  • [34] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019-07) Streaming Low-Rank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.
  • [35] T. van Leeuwen and F. J. Herrmann (2014-10)

    3D frequency-domain seismic inversion with controlled sloppiness

    .
    SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1.
  • [36] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien (2019)

    Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

    .
    In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4.
  • [37] R. Vershynin (2018-11)

    High-Dimensional Probability: An Introduction with Applications in Data Science

    .
    Cambridge University Press. Cited by: §A.2.2.
  • [38] Z. Wang, S. H. Nelaturu, and S. Amarasinghe (2019-17 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.

4 Experiments

True
Table 1: Training accuracy for varying batchsizes and number of probing vectors on the MNIST dataset.

Even though memory and computational gains of our proposed method can be significant during backpropagation, accuracy of trained networks needs to be verified. To this end, we conduct a number of experiments on the MNIST and CIFAR10 datasets. In these experiments, we vary the batchsize and the number of probing vectors . Implementations both in Julia and Python are evaluated.

MNIST dataset

We start by training two "MNIST networks" (detailed in Table 2 and 3 of appendix A.3 for Julia and PyTorch with training parameters listed in appendix A.4) for varying batchsizes and number of probing vectors . The network test accuracies for the the Julia implementation, where the default convolutional layer implementation is replaced by XConv.jl, are listed in Table 1 for the default implementation and for our implementation where gradients of the convolutional layers are replaced by our approximations. The results show that our low-memory implementation remains competitive (compare numbers in bold) even for a small number of probing vectors, yielding a memory saving of about .

We obtained the results listed in Table 1 with the ADAM [23] optimization algorithm. In an effort to add robustness when training overparameterized deep neural networks, we switch in the next example to stochastic line searches (SLS, [36]

) that remove the need to set hyperparameters manually. With this algorithm, the line search parameters are set automatically at the cost of an extra gradient calculation. Figure 

10

shows the test accuracies as a function of the number of epochs, batchsize

and number of probing vectors . Because the randomized trace estimation is unbiased, we observe convergence as increases. Despite relatively large approximation errors for small , we also notice that the induced randomness by our approximate gradient calculations does not adversely affect the line searches. As in the previous example, we achieve competitive results with slight random fluctuations for for all batchsizes, resulting in a reduction in memory use by a factor of about .

Figure 10: Training for varying batchsizes and probing sizes. This experiment ran with the Stochastic Line Search algorithm (SLS, [36])
CIFAR10 dataset

To conclude our empirical validation of approximate gradient calculations with multi-channel randomized trace estimation, we train a network on the CIFAR10 dataset. Compared to the previous examples, this is a more challenging larger realistic training problem. To mimic an actual training scenario, memory usage is fixed between the regular gradient, and the approximate gradients obtained by probing independently ("Indep." in blue with ), multi-channel ("Multi." in green with ), and multi-channel with orthogonalization ("Multi-Ortho" in red with ). The batchsize for the approximate gradient examples is increased from to to reflect the smaller memory footprint. Results for the training/testing loss and accuracy are included in Figure 11. The following observations can be made from these plots. First, there is a clear gap between the training/testing loss for the true and approximate gradients. This gap is also present in the training/testing albeit it is relatively small. However, because of doubling the batchsize the runtime for the training is effectively halved.

Figure 11: CIFAR10 training with equivalent memory comparing our approximate method to standard training. The four panels show the training and test loss (top row) and training and test accuracy (bottom row) after 100 epochs.

5 Related work

The continued demand to train larger and larger networks, for tasks such as video compression and classification in 3D, puts pressure on the memory of accelerators (GPUs, etc.), which is in short supply. This memory pressure is exacerbated when training relies on backpropagation that in its mundane form calls for storage of the state variables during the forward pass. To relieve this memory pressure several attempts have been made, ranging from the use of optimal checkpointing [9, 3] to the use of invertible neural networks [12, 21, 14]. While these approaches can reduce the memory footprint during training, they introduce significant computational overhead, algorithmic complexity, and invertible neural network implementations that may lack in expressibility.

Alternatively, people have relied on approximate arithmetic [10], on replacing symmetric backpropagation by direct feedback alignment [28, 13, 8], where random projections on the residual are used, or on approximations of gradients and Jacobians with techniques borrowed from randomized linear algebra [2, 25, 24]. Compared to these other approaches, the latter is capable of producing approximations that are unbiased, a highly desirable feature when training neural networks with stochastic optimization [26]. Instead of randomizing the forward pass in stochastic computational graphs, we propose to approximate gradients by exploiting the special structure of gradients of convolutional layers. This structure allows us to use the relatively simple method of randomized trace estimation to approximate the gradient while reducing the memory footprint significantly. While perhaps less versatile than the recently proposed method of randomized automatic differentiation [29], our approach does not need intervention in the computational graph and acts as a drop-in replacement for the 2D and 3D convolutional layers in existing machine learning frameworks.

6 Conclusion and Future work

We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming state-of-the-art neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

    2. Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 2722. Hardware details for the networks training are described in the appendix.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? The License (MIT) is in the code repository

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References

  • [1] A. Y. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen (2012-08) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
  • [2] H. Avron and S. Toledo (2011-04) Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58 (2). External Links: ISSN 0004-5411, Link, Document Cited by: §1, §2, §5.
  • [3] O. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
  • [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
  • [5] K. Chellapilla, S. Puri, and P. Simard (2006-10) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.
  • [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer (2014)

    cuDNN: Efficient Primitives for Deep Learning

    .
    arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.
  • [7] A. Cortinovis and D. Kressner (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
  • [8] C. Frenkel, M. Lefebvre, and D. Bol (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662-453X Cited by: §1, §5.
  • [9] A. Griewank and A. Walther (2000-03) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 0098-3500, Link, Document Cited by: §1, §5.
  • [10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
  • [11] E. Haber, M. Chung, and F. J. Herrmann (2012-07) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
  • [12] E. Haber and L. Ruthotto (2017-12) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
  • [13] D. Han and H. Yoo (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
  • [14] T. Hascoet, Q. Febvre, Y. Ariki, and T. Takiguchi (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.
  • [16] W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang, and T. Goldstein (2020-12 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
  • [17] M.F. Hutchinson (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
  • [18] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
  • [19] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.
  • [20] M. Innes (2018) Flux: Elegant Machine Learning with Julia.

    Journal of Open Source Software

    .
    External Links: Document, Link Cited by: §3.
  • [21] J. Jacobsen, A. W.M. Smeulders, and E. Oyallon (2018) i-RevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • [22] B. Kaperick (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
  • [24] P. Martinsson and J. A. Tropp (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
  • [25] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff (2020-10) Hutch++: Optimal Stochastic Trace Estimation. arXiv e-prints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
  • [26] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
  • [27] A. Nøkland and L. H. Eidnes (2019-09–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
  • [28] A. Nøkland (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
  • [29] D. Oktay, N. McGreivy, J. Aduol, A. Beatson, and R. P. Adams (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
  • [31] K. B. Petersen and M. S. Pedersen (2008-10) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
  • [32] F. Roosta-Khorasani and U. Ascher (2015-10) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 1615-3375, Link, Document Cited by: §1.
  • [33] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, and F. Krzakala (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
  • [34] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019-07) Streaming Low-Rank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.
  • [35] T. van Leeuwen and F. J. Herrmann (2014-10)

    3D frequency-domain seismic inversion with controlled sloppiness

    .
    SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1.
  • [36] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien (2019)

    Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

    .
    In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4.
  • [37] R. Vershynin (2018-11)

    High-Dimensional Probability: An Introduction with Applications in Data Science

    .
    Cambridge University Press. Cited by: §A.2.2.
  • [38] Z. Wang, S. H. Nelaturu, and S. Amarasinghe (2019-17 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.

5 Related work

The continued demand to train larger and larger networks, for tasks such as video compression and classification in 3D, puts pressure on the memory of accelerators (GPUs, etc.), which is in short supply. This memory pressure is exacerbated when training relies on backpropagation that in its mundane form calls for storage of the state variables during the forward pass. To relieve this memory pressure several attempts have been made, ranging from the use of optimal checkpointing [9, 3] to the use of invertible neural networks [12, 21, 14]. While these approaches can reduce the memory footprint during training, they introduce significant computational overhead, algorithmic complexity, and invertible neural network implementations that may lack in expressibility.

Alternatively, people have relied on approximate arithmetic [10], on replacing symmetric backpropagation by direct feedback alignment [28, 13, 8], where random projections on the residual are used, or on approximations of gradients and Jacobians with techniques borrowed from randomized linear algebra [2, 25, 24]. Compared to these other approaches, the latter is capable of producing approximations that are unbiased, a highly desirable feature when training neural networks with stochastic optimization [26]. Instead of randomizing the forward pass in stochastic computational graphs, we propose to approximate gradients by exploiting the special structure of gradients of convolutional layers. This structure allows us to use the relatively simple method of randomized trace estimation to approximate the gradient while reducing the memory footprint significantly. While perhaps less versatile than the recently proposed method of randomized automatic differentiation [29], our approach does not need intervention in the computational graph and acts as a drop-in replacement for the 2D and 3D convolutional layers in existing machine learning frameworks.

6 Conclusion and Future work

We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming state-of-the-art neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

    2. Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 2722. Hardware details for the networks training are described in the appendix.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? The License (MIT) is in the code repository

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References

  • [1] A. Y. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen (2012-08) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
  • [2] H. Avron and S. Toledo (2011-04) Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58 (2). External Links: ISSN 0004-5411, Link, Document Cited by: §1, §2, §5.
  • [3] O. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
  • [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
  • [5] K. Chellapilla, S. Puri, and P. Simard (2006-10) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.
  • [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer (2014)

    cuDNN: Efficient Primitives for Deep Learning

    .
    arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.
  • [7] A. Cortinovis and D. Kressner (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
  • [8] C. Frenkel, M. Lefebvre, and D. Bol (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662-453X Cited by: §1, §5.
  • [9] A. Griewank and A. Walther (2000-03) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 0098-3500, Link, Document Cited by: §1, §5.
  • [10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
  • [11] E. Haber, M. Chung, and F. J. Herrmann (2012-07) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
  • [12] E. Haber and L. Ruthotto (2017-12) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
  • [13] D. Han and H. Yoo (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
  • [14] T. Hascoet, Q. Febvre, Y. Ariki, and T. Takiguchi (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.
  • [16] W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang, and T. Goldstein (2020-12 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
  • [17] M.F. Hutchinson (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
  • [18] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
  • [19] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.
  • [20] M. Innes (2018) Flux: Elegant Machine Learning with Julia.

    Journal of Open Source Software

    .
    External Links: Document, Link Cited by: §3.
  • [21] J. Jacobsen, A. W.M. Smeulders, and E. Oyallon (2018) i-RevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • [22] B. Kaperick (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
  • [24] P. Martinsson and J. A. Tropp (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
  • [25] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff (2020-10) Hutch++: Optimal Stochastic Trace Estimation. arXiv e-prints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
  • [26] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
  • [27] A. Nøkland and L. H. Eidnes (2019-09–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
  • [28] A. Nøkland (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
  • [29] D. Oktay, N. McGreivy, J. Aduol, A. Beatson, and R. P. Adams (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
  • [31] K. B. Petersen and M. S. Pedersen (2008-10) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
  • [32] F. Roosta-Khorasani and U. Ascher (2015-10) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 1615-3375, Link, Document Cited by: §1.
  • [33] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, and F. Krzakala (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
  • [34] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019-07) Streaming Low-Rank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.
  • [35] T. van Leeuwen and F. J. Herrmann (2014-10)

    3D frequency-domain seismic inversion with controlled sloppiness

    .
    SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1.
  • [36] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien (2019)

    Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

    .
    In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4.
  • [37] R. Vershynin (2018-11)

    High-Dimensional Probability: An Introduction with Applications in Data Science

    .
    Cambridge University Press. Cited by: §A.2.2.
  • [38] Z. Wang, S. H. Nelaturu, and S. Amarasinghe (2019-17 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.

6 Conclusion and Future work

We introduced a novel take on convolutional layers grounded on recent work in randomized linear algebra that allows for unbiased estimation of the trace via randomized probing. Aside from being memory efficient—i.e., the state variable only needs to be stored in a majorly compressed form, the proposed approach, where gradients with respect to convolution weights are approximated by traces, also has computational advantages outperforming state-of-the-art neural network implementations. In addition, randomized trace estimation comes with convergence guarantees and error estimates, which have the potential to inform the proposed algorithm. While there is still room for improvements, networks trained with approximate gradients calculated with randomized probing have a performance that is very close to that of the most advanced training methods with an error that decreases with the number of randomized probing vectors. The latter opens enticing perspectives given recent developments in specialized photonic hardware where the speed of randomized probing is drastically increased [33]. This will allow future implementations of our approach to scale to large problems in video representation learning and other 3D applications.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

    2. Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 2722. Hardware details for the networks training are described in the appendix.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? The License (MIT) is in the code repository

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References

  • [1] A. Y. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen (2012-08) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
  • [2] H. Avron and S. Toledo (2011-04) Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58 (2). External Links: ISSN 0004-5411, Link, Document Cited by: §1, §2, §5.
  • [3] O. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
  • [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
  • [5] K. Chellapilla, S. Puri, and P. Simard (2006-10) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.
  • [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer (2014)

    cuDNN: Efficient Primitives for Deep Learning

    .
    arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.
  • [7] A. Cortinovis and D. Kressner (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
  • [8] C. Frenkel, M. Lefebvre, and D. Bol (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662-453X Cited by: §1, §5.
  • [9] A. Griewank and A. Walther (2000-03) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 0098-3500, Link, Document Cited by: §1, §5.
  • [10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
  • [11] E. Haber, M. Chung, and F. J. Herrmann (2012-07) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
  • [12] E. Haber and L. Ruthotto (2017-12) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
  • [13] D. Han and H. Yoo (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
  • [14] T. Hascoet, Q. Febvre, Y. Ariki, and T. Takiguchi (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.
  • [16] W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang, and T. Goldstein (2020-12 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
  • [17] M.F. Hutchinson (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
  • [18] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
  • [19] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.
  • [20] M. Innes (2018) Flux: Elegant Machine Learning with Julia.

    Journal of Open Source Software

    .
    External Links: Document, Link Cited by: §3.
  • [21] J. Jacobsen, A. W.M. Smeulders, and E. Oyallon (2018) i-RevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • [22] B. Kaperick (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
  • [24] P. Martinsson and J. A. Tropp (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
  • [25] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff (2020-10) Hutch++: Optimal Stochastic Trace Estimation. arXiv e-prints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
  • [26] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
  • [27] A. Nøkland and L. H. Eidnes (2019-09–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
  • [28] A. Nøkland (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
  • [29] D. Oktay, N. McGreivy, J. Aduol, A. Beatson, and R. P. Adams (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
  • [31] K. B. Petersen and M. S. Pedersen (2008-10) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
  • [32] F. Roosta-Khorasani and U. Ascher (2015-10) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 1615-3375, Link, Document Cited by: §1.
  • [33] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, and F. Krzakala (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
  • [34] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019-07) Streaming Low-Rank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.
  • [35] T. van Leeuwen and F. J. Herrmann (2014-10)

    3D frequency-domain seismic inversion with controlled sloppiness

    .
    SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1.
  • [36] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien (2019)

    Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

    .
    In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4.
  • [37] R. Vershynin (2018-11)

    High-Dimensional Probability: An Introduction with Applications in Data Science

    .
    Cambridge University Press. Cited by: §A.2.2.
  • [38] Z. Wang, S. H. Nelaturu, and S. Amarasinghe (2019-17 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? See section 2 for the main theoretical result and hypothesis

    2. Did you include complete proofs of all theoretical results? The complete proof of the proposition is detailed in the appendix.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? The link to the public repository with all the scripts is provided in Section A.1 in the appendix

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? The networks and hyperparameters are described in the appendix. In addition all results presented here are reproducible with individual script in the linked open repository.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? Error analysis is provided in Section 3

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? Resources used for the main performance results are provided in the benchmark result caption on Figure 2722. Hardware details for the networks training are described in the appendix.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets? The License (MIT) is in the code repository

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

References

  • [1] A. Y. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen (2012-08) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
  • [2] H. Avron and S. Toledo (2011-04) Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58 (2). External Links: ISSN 0004-5411, Link, Document Cited by: §1, §2, §5.
  • [3] O. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
  • [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
  • [5] K. Chellapilla, S. Puri, and P. Simard (2006-10) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.
  • [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer (2014)

    cuDNN: Efficient Primitives for Deep Learning

    .
    arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.
  • [7] A. Cortinovis and D. Kressner (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
  • [8] C. Frenkel, M. Lefebvre, and D. Bol (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662-453X Cited by: §1, §5.
  • [9] A. Griewank and A. Walther (2000-03) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 0098-3500, Link, Document Cited by: §1, §5.
  • [10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
  • [11] E. Haber, M. Chung, and F. J. Herrmann (2012-07) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
  • [12] E. Haber and L. Ruthotto (2017-12) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
  • [13] D. Han and H. Yoo (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
  • [14] T. Hascoet, Q. Febvre, Y. Ariki, and T. Takiguchi (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.
  • [16] W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang, and T. Goldstein (2020-12 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
  • [17] M.F. Hutchinson (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
  • [18] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
  • [19] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.
  • [20] M. Innes (2018) Flux: Elegant Machine Learning with Julia.

    Journal of Open Source Software

    .
    External Links: Document, Link Cited by: §3.
  • [21] J. Jacobsen, A. W.M. Smeulders, and E. Oyallon (2018) i-RevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • [22] B. Kaperick (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
  • [24] P. Martinsson and J. A. Tropp (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
  • [25] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff (2020-10) Hutch++: Optimal Stochastic Trace Estimation. arXiv e-prints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
  • [26] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
  • [27] A. Nøkland and L. H. Eidnes (2019-09–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
  • [28] A. Nøkland (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
  • [29] D. Oktay, N. McGreivy, J. Aduol, A. Beatson, and R. P. Adams (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
  • [31] K. B. Petersen and M. S. Pedersen (2008-10) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
  • [32] F. Roosta-Khorasani and U. Ascher (2015-10) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 1615-3375, Link, Document Cited by: §1.
  • [33] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, and F. Krzakala (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
  • [34] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019-07) Streaming Low-Rank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.
  • [35] T. van Leeuwen and F. J. Herrmann (2014-10)

    3D frequency-domain seismic inversion with controlled sloppiness

    .
    SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1.
  • [36] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien (2019)

    Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

    .
    In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4.
  • [37] R. Vershynin (2018-11)

    High-Dimensional Probability: An Introduction with Applications in Data Science

    .
    Cambridge University Press. Cited by: §A.2.2.
  • [38] Z. Wang, S. H. Nelaturu, and S. Amarasinghe (2019-17 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.

References

  • [1] A. Y. Aravkin, M. P. Friedlander, F. J. Herrmann, and T. van Leeuwen (2012-08) Robust inversion, dimensionality reduction, and randomized sampling. Mathematical Programming 134 (1), pp. 101–125. External Links: Document, Link Cited by: §1.
  • [2] H. Avron and S. Toledo (2011-04) Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix. J. ACM 58 (2). External Links: ISSN 0004-5411, Link, Document Cited by: §1, §2, §5.
  • [3] O. Beaumont, L. Eyraud-Dubois, J. Herrmann, A. Joly, and A. Shilova (2019) Optimal checkpointing for heterogeneous chains: how to train deep neural networks with limited memory. CoRR abs/1911.13214. External Links: Link, 1911.13214 Cited by: §1, §5.
  • [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah (2017) Julia: A Fresh Approach to Numerical Computing. SIAM Review 59 (1), pp. 65–98. External Links: Link, Document Cited by: §3.
  • [5] K. Chellapilla, S. Puri, and P. Simard (2006-10) High Performance Convolutional Neural Networks for Document Processing. In Tenth International Workshop on Frontiers in Handwriting Recognition, La Baule (France). Note: http://www.suvisoft.com External Links: Link Cited by: §3.
  • [6] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer (2014)

    cuDNN: Efficient Primitives for Deep Learning

    .
    arXiv preprint arXiv:1410.0759. External Links: 1410.0759, Link Cited by: §3.
  • [7] A. Cortinovis and D. Kressner (2021) On randomized trace estimates for indefinite matrices with an application to determinants. arXiv preprint arXiv:2005.10009 abs/2005.10009. External Links: 2005.10009, Link Cited by: §1, §2, Lemma 1.
  • [8] C. Frenkel, M. Lefebvre, and D. Bol (2021) Learning Without Feedback: Fixed Random Learning Signals Allow for Feedforward Training of Deep Neural Networks. Frontiers in Neuroscience 15, pp. 20. External Links: Link, Document, ISSN 1662-453X Cited by: §1, §5.
  • [9] A. Griewank and A. Walther (2000-03) Algorithm 799: Revolve: An Implementation of Checkpointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM Trans. Math. Softw. 26 (1), pp. 19–45. External Links: ISSN 0098-3500, Link, Document Cited by: §1, §5.
  • [10] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. CoRR abs/1502.02551. External Links: Link, 1502.02551 Cited by: §1, §5.
  • [11] E. Haber, M. Chung, and F. J. Herrmann (2012-07) An effective method for parameter estimation with pde constraints with multiple right hand sides. SIAM Journal on Optimization 22 (3). External Links: Link Cited by: §1.
  • [12] E. Haber and L. Ruthotto (2017-12) Stable Architectures for Deep Neural Networks. Inverse Problems 34 (1), pp. 014004. External Links: Document, Link Cited by: §1, §5.
  • [13] D. Han and H. Yoo (2019) Efficient Convolutional Neural Network Training with Direct Feedback Alignment. arXiv preprint arXiv:1901.0198. External Links: 1901.01986, Link Cited by: §1, §5.
  • [14] T. Hascoet, Q. Febvre, Y. Ariki, and T. Takiguchi (2019) Reversible designs for extreme memory cost reduction of cnn training. arXiv preprint arXiv:1910.11127. External Links: Link Cited by: §1, §5.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Vol. , pp. 770–778. External Links: Document, Link Cited by: §3.
  • [16] W. R. Huang, Z. Emam, M. Goldblum, L. Fowl, J. K. Terry, F. Huang, and T. Goldstein (2020-12 Dec) Understanding Generalization Through Visualizations. In Proceedings on "I Can’t Believe It’s Not Better!" at NeurIPS Workshops, Proceedings of Machine Learning Research, Vol. 137, pp. 87–97. External Links: Link Cited by: §3.
  • [17] M.F. Hutchinson (1989) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 18 (3), pp. 1059–1076. External Links: Document, Link, https://doi.org/10.1080/03610918908812806 Cited by: §1, §2.
  • [18] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §3.
  • [19] M. Innes, E. Saba, K. Fischer, D. Gandhi, M. C. Rudilosso, N. M. Joy, T. Karmali, A. Pal, and V. Shah (2018) Fashionable Modelling with Flux. CoRR abs/1811.01457. External Links: Link, 1811.01457 Cited by: §3.
  • [20] M. Innes (2018) Flux: Elegant Machine Learning with Julia.

    Journal of Open Source Software

    .
    External Links: Document, Link Cited by: §3.
  • [21] J. Jacobsen, A. W.M. Smeulders, and E. Oyallon (2018) i-RevNet: Deep Invertible Networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5.
  • [22] B. Kaperick (2019) Diagonal estimation with probing methods. Ph.D. Thesis, Virginia Polytechnic Institute and State University. External Links: Link Cited by: §2.
  • [23] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §4.
  • [24] P. Martinsson and J. A. Tropp (2020) Randomized Numerical Linear Algebra: Foundations & Algorithms. Acta Numerica 29, pp. 403–572. External Links: Document, Link Cited by: §1, §1, §2, §5.
  • [25] R. A. Meyer, C. Musco, C. Musco, and D. P. Woodruff (2020-10) Hutch++: Optimal Stochastic Trace Estimation. arXiv e-prints, pp. arXiv:2010.09649. External Links: 2010.09649, Link Cited by: §1, §2, §5.
  • [26] A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens (2015) Adding gradient noise improves learning for very deep networks. External Links: 1511.06807 Cited by: §3, §5.
  • [27] A. Nøkland and L. H. Eidnes (2019-09–15 June) Training Neural Networks with Local Error Signals. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 4839–4850. External Links: Link Cited by: §1.
  • [28] A. Nøkland (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. In Advances in Neural Information Processing Systems, Vol. 29, pp. . External Links: Link Cited by: §1, §5.
  • [29] D. Oktay, N. McGreivy, J. Aduol, A. Beatson, and R. P. Adams (2021) Randomized Automatic Differentiation. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §1, §1, §3, §3, §5.
  • [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §3.
  • [31] K. B. Petersen and M. S. Pedersen (2008-10) The Matrix Cookbook. Technical University of Denmark. Note: Version 20081110 External Links: Review Matrix Cookbook, Link Cited by: §2.
  • [32] F. Roosta-Khorasani and U. Ascher (2015-10) Improved bounds on sample size for implicit matrix trace estimators. Found. Comput. Math. 15 (5), pp. 1187–1212. External Links: ISSN 1615-3375, Link, Document Cited by: §1.
  • [33] A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, and F. Krzakala (2016) Random projections through multiple optical scattering: approximating kernels at the speed of light. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6215–6219. External Links: Document Cited by: §6.
  • [34] J. A. Tropp, A. Yurtsever, M. Udell, and V. Cevher (2019-07) Streaming Low-Rank Matrix Approximation with an Application to Scientific Simulation. SIAM Journal on Scientific Computing 41 (4), pp. A2430–A2463. External Links: Document, Link, https://doi.org/10.1137/18M1201068 Cited by: §1, §2.
  • [35] T. van Leeuwen and F. J. Herrmann (2014-10)

    3D frequency-domain seismic inversion with controlled sloppiness

    .
    SIAM Journal on Scientific Computing 36 (5), pp. S192–S217. Note: (SISC) External Links: Document, Link Cited by: §1.
  • [36] S. Vaswani, A. Mishkin, I. Laradji, M. Schmidt, G. Gidel, and S. Lacoste-Julien (2019)

    Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

    .
    In Advances in Neural Information Processing Systems, Vol. 32, pp. . External Links: Link Cited by: 3rd item, Figure 10, §4.
  • [37] R. Vershynin (2018-11)

    High-Dimensional Probability: An Introduction with Applications in Data Science

    .
    Cambridge University Press. Cited by: §A.2.2.
  • [38] Z. Wang, S. H. Nelaturu, and S. Amarasinghe (2019-17 February) Accelerated CNN Training through Gradient Approximation. In 2019 2nd Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), Vol. , pp. 31–35. External Links: Document, Link Cited by: §1.

Appendix A Appendix

a.1 Implementation and code availability

For the anonymous submission, we provide the software in a .zip format with author information removed and will replace it by the GitHub repository after review. The directory paper contains the scripts to reproduce the figures. However, this software is intended to be a usable software package rather than only a set of runnable examples that can be easily plugged into existing framework seamlessly. The code is therefore organized to be installed and used as a standard pip and Julia package.

Our probing algorithm is implemented both in Julia, using LinearAlgebra.BLAS on CPU and CUDA.CUBALS on GPU for the linear algebra computations, and in PyTorch using standard linear algebra utilities. The Julia interface is designed so that preexisting networks can be reused as we are overloading rrule (see ChainRulesCore.jl) to switch easily between the conventional true gradient (NNlib.jl) and ours. The PyTorch implementation defines a new layer that can be swapped for the conventional convolutional layer, torch.nn.Conv2d or torch.nn.Conv3d, in any network using the convert_net utility function.

a.2 Proofs of Preposition 1 and Theorem 1

For a square matrix , let be the trace estimator:

where be i.i.d. Gaussian vectors. We now prove the proposition and theorem stated in Section 2.

a.2.1 Proof of Proposition 1

We restate Proposition 1 here.

Proposition 1.

Let be a square matrix. Then for any small number , with probability ,

The proof uses the following result on trace estimation of symmetric matrices.

Lemma 1 (Theorem 5 of [7]).

Let be symmetric. Then

for all .

Proof of Proposition 1.

For a symmetric matrix , Lemma 1 immediately implies that for any small number , with probability ,

Now for our asymmetric , let . Then , , , and , then the proposition follows. ∎

a.2.2 Preparation lemmas for Theorem 2

Lemma 2.

Let be a square matrix, be random Gaussian vectors for , and all the and are independent of each other. Then for any with probability ,

where is some absolute constant independent of .

proof of Lemma 2.

Set . For each summand, we have

(9)

where the first equality used the singular value decomposition

, in the second equality, we defined and , which are still Gaussian. In the third equality, we used to denote the diagonal entry of and and to denote the entry of and , respectively. In the last equality, we defined . Since and are i.i.d., so are . And since

are products of independent sub-Gaussian random variables, they obey the sub-exponential distribution, i.e.,

where denotes the sub-exponential norm and the sub-Gaussian norm. We also used the property that there is a constant , such that for any , a Gaussian variable has a sub- Gaussian norm , and this property is applied on and who are both variables due to the rotation invariance of Gaussian vectors.

Apply the Bernstein inequality [37] to , we obtain

where is some absolute constant. Letting to be the right hand side probability, the above implies