1 Introduction
Given the recent success and increasing impact of deep learning, there has been significant work on improving the fundamental technical tools underpinning its progress. One such tool is the ability to estimate the local geometry of the loss function for deep models, which often comes in the form of estimates for secondorder (Hessian) information. Such information is critical in several settings, such as neural network optimization and pruning.
Directly using Hessian information in the context of deep learning is infeasible: for example, just storing the Hessian matrix for the standard ResNet50 model 2016he would occupy 2.5 Petabytes of memory. These constraints have inspired significant work on efficient numerical approximations of the Hessian for deep neural networks, for example the line of work on the KFAC approximation 2015kingma ; 2015martens ; 2016ba ; 2019zheng , or efficient blockwise approximations yao2020adahessian ; 2020singh . One of the classic approaches, which we focus on in this paper, is the empirical Fisher (1992hassibi, ; 1998amari, ; amari2016information, ) approximation to the Hessian, written as:
(1) 
where is the number of samples, denotes the gradient w.r.t. the th sample at the given point, and is the outer product of individual gradients, which we view as column vectors.
The empirical Fisher approximation is fairly standard, e.g. 1992hassibi ; 1998amari ; amari2016information , and has been recognized to be useful in a variety of practical settings where exact estimation of the Hessian or of the true Fisher information matrix is not feasible. At the same time, there is still active research in the community on the conditions for its applicability 2014martens ; 2019kunstner ; 2020singh ; thomas2020interplay .
A useful property of this approximation is that it also allows to estimate the inverse of the Hessian, which is essential in many applications. Specifically, the fact that the empirical Fisher can be written as a sum of rankone matrices allows the use of the WoodburyShermanMorrison inversion formula woodbury1950inverting to exactly compute its inverse, by recursively integrating terms corresponding to each gradient into the inverse. (Please see Equation 2 for an exact derivation.) This approach was independently proposed by 1992hassibi ; 1998amari , for pruning and optimization, respectively, where it was validated on small networks, with hundreds of weights.
The idea was adapted to deep neural networks (DNNs) by 2020singh , through approximation of the inverse in small diagonal blocks. The authors show improved approximation quality for the Hessian inverse relative to a simple diagonal approximation, and that this leads to competitive pruning results in terms of accuracy versus average sparsity level. Yet, this approach is limited by the blockwise approximation: for block size and dimension , it requires time to recursively build the blockwise Fisher approximation using gradients, and time and memory for computing the InverseHessianVectorProducts (IHVPs) necessary for estimating pruning statistics. Clearly, the ideal case is still intractable at scale, and generally it is still unknown whether lineartime algorithms are possible for computing IHVPs in this context.
Contribution. We resolve this question by introducing two efficient algorithms for computing IHVPs under the empirical Fisher approximation, with computational and storage costs that are linear in the dimension of the model, without the need for blockwise approximations. Concretely, we provide exact matrixfree algorithms to compute products of the form where is the inverse empirical Fisher, and is an arbitrary vector. We show that these algorithms can be implemented efficiently at scale, and that they can lead to more accurate neural network pruning and optimization.
The Static Algorithm. Our first algorithm assumes a static scenario, which is standard in neural network pruning: we are given a fullytrained model , for which we wish to estimate IHVPs and diagonal elements of the inverse Hessian, in order to determine the “optimal” pruning update using e.g. the Optimal Brain Surgeon (OBS) framework (1990lecun, ; 1992hassibi, ). For this, we first compute gradients at , which we will use to estimate IHVPs via the empirical Fisher and compute pruning statistics. The main idea is that, since we only wish to compute products between the inverse and an arbitrary vector (i.e. IHVPs), we can rewrite the Woodbury recursion such that we work exclusively with individual vectors and scalars, and never with full matrices. Given model dimension and gradients, each defining a rankone component of the empirical Fisher, the algorithm uses precomputation time, and will have cost for exactly computing the IHVP. Further, we can specialize the algorithm to directly query elements of the Hessian inverse, at a cost of time per element. This provides efficient, linearin implementations for all operations required by the OBS pruning framework.
The Dynamic Algorithm. Next, we adapt this approach to SGD optimization, i.e. to precondition stochastic gradients by our estimate of the inverse Hessian. In this case, we use the classic idea of bootstrapping the approximation by leveraging previous gradients: the preconditioner at time is built from gradients obtained during a “sliding window” over the last optimization steps. This requires a dynamic representation, allowing addition and removal of gradients without full recomputation of secondorder statistics.
Our main technical contribution is showing that this can be achieved, with approximately time and space complexity. The key idea is that, for any set of gradients, and any vector , it is possible to represent the corresponding IHVP estimate as a linear combination of terms corresponding to individual gradients and , of the form
Crucially, we ensure that the coefficients can be computed just via dot products , where in the ordering, and . Then, to replace a given gradient from this representation, we just have to compute scalar products with the new gradient vector, and to update some intermediate information. Hence, the entire update operation has computational cost for adding or removing any gradient from the sliding window, and for computing the IHVP.
Implementation and Experiments. We provide an efficient vectorized implementation for the above algorithms, called MFAC (for MatrixFree Approximate Curvature
) as well as pruning and optimization code, in PyTorch
paszke2019pytorch . Our implementation includes several additional optimizations, in particular GPU acceleration with custom CUDA code, and minimizes the cost of memory transfer between the GPU and main memory via a nontrivial gradient paging mechanism. For pruning operations, our implementation provides orderofmagnitude improvements in terms of computation time over the blockwise approximation of 2020singhfor classic pruning benchmarks such as ResNet50 and MobileNet on the ImageNet dataset. In turn, this allows us to obtain more accurate pruned models by exploiting higher numbers of gradients and larger block sizes. What is more, our preconditioned SGD can be competitive with stateoftheart optimizers for deep learning in terms of accuracy, and has endtoend computational overhead per step of 5%–55% relative to vanilla SGD.
2 Preliminaries and Related Work
General Definitions. We now briefly introduce the setting and notation; we refer the reader to standard texts 2014martens ; ly2017tutorial ; amari2016information for a complete introduction. We start from the standard setting in which we are given a dataset , and wish to identify a dimensional model to minimize an empirical loss , defined as . For a twicedifferentiable loss , the Hessian is the matrix of secondorder derivatives of w.r.t. , i.e. . In the probabilistic view, each input example
has some probability of being assigned a given label
. Given input examples drawn from a distribution and corresponding outputs drawn from a conditional distribution, the goal is to minimize the distance between the target joint distribution
, and a learned joint distribution , where is the model.The Fisher Matrix. Assuming the probabilistic view, it can be shown that the Fisher information matrix of the model’s joint distribution satisfies . If the model’s output conditional distribution matches the conditional distribution of the data, then the Fisher and Hessian matrices are in fact equivalent ly2017tutorial . Roughly, this can be interpreted as saying that, if the model has high accuracy, one can approximate the Hessian of with the Fisher matrix. Further, it is sometimes useful to consider an approximation to the Fisher, where the distribution is replaced with the empirical training distribution , leading to the empirical Fisher matrix :
The WoodburyShermanMorrison Trick. The fact that this approximation is a sum of rankone matrices has the benefit that it allows efficient exact computation of the Fisher inverse. Specifically, one can apply the classic WoodburyShermanMorrison formula to compute the inverse recursively as
(2) 
where the recursion is over samples, and the base step is , with being a small positive constant. This approach was independently introduced by Hassibi and Stork 1992hassibi for pruning, and Amari 1998amari for natural gradient. Singh and Alistarh 2020singh showed that it can be scaled to DNNs by blockwise approximation, and that it provides better approximations of the loss than KFACbased 2019wang , or diagonal 2018theis ; 2017dong approximations, leading to more accurate pruning. The main shortcoming of directly applying (2) is the prohibitive computational cost of , even when reduced to by blockwise approximation. Our method proposes a matrixfree approach for exactly calculating IHVPs and querying individual elements of the inverse Hessian, of cost and , respectively, after precomputation. Our method is numerically equivalent to the direct computation above.
Diagonal Approximations. A common approximation, both for optimization, e.g. krishnan2017neumann ; kingma2014adam but also for pruning 2018theis , is to assume that the Fisher matrix is diagonal. In our notation, this method has setup cost , and diagonal query cost . However, as evident from the experimental results of 2020singh this provides lower approximation quality relative to both blockwise methods or KFAC 2019wang .
KFAC Approximations. This approach observes that the entries of the true Fisher corresponding to blocks “between” two layers, which can be written as the expectation of a Kronecker product between two matrices, can be approximated as the Kronecker product of the expectations of those two matrices (reversing the order between the product and the expectation). This approximation has been leveraged for both pruning and for optimization, and it allows efficient computation of the inverse 2016ba ; Osawa_2019 ; 2019zheng ; 2019wang ; laurent2018an ; however, it is known that this approximation does not always hold 2015martens . Another relative drawback is that the Kronecker factorization only occurs naturally for fullyconnected layers; though, there is work on extensions to other layer types grosse2016kroneckerfactored ; martens2018kroneckerfactored , via additional approximations. We compare against KFACbased optimizers and pruners, and find that our method yields better results.
Additional Approaches. Our approach is similar in spirit to matrixfree methods nocedal2006numerical ; liu1989limited ; martens_free , but the algorithms we present are new. Hessianfree optimization martens_free also forgoes the explicit computation of Hessians in favor of computing an IHVP with a vector . However, this estimation is performed by iteratively approximating the solution to the linear system for some given without ever explicitly forming . One disadvantage of this method in practice is that it requires tuning and several costly iterations to converge (for a single ), as the underlying Hessian can be illconditioned. The LOBS pruning method 2017dong approximates secondorder information by defining independent layerwise objectives, which allows the direct approximation of the layerwise Hessians via a carefullycrafted block structure. In contrast, our approach allows for fullyglobal Hessian estimation, and yields better pruning results at scale.
Fullmatrix adaptive regularization has similar goals, but in the context of adaptive optimizers Duchi2010AdaptiveSM . Agarwal et al. agarwal2019efficient proposed GGT, which allows the efficient computation of the inverse square root of the lowrank matrix resulting from the sum of gradient outer products over a sliding window. At its core, this procedure requires an eigendecomposition (the authors take an SVD) of an matrix at every time step, which is reasonably efficient for small values of .
Our dynamic algorithm solves a similar, but slightly simpler problem, as we only want to invert the matrix (rather than also computing its square root). However, we do not perform any eigendecompositions; instead, we carefully maintain intermediate information that allows an efficient explicit computation of the scalar coefficients in Equation 6. At the end of Section 3.2, we discuss how this approach is considerably more efficient in practice and allows executing at larger window sizes (with small overhead), which leads to improved model accuracy as we show in our experiments. Our dynamic algorithm has perstep cost relative to for GGT (agarwal2019efficient, ); this can also result in lower overhead versus GGT, even when the SVD cost is negligible (e.g. if is small).
Yao et al. yao2020adahessian
recently provided an alternative method for approximating the diagonal of the inverse Hessian, using Hutchinson’s randomized algorithm for estimating the diagonal. To mitigate the variance, the authors additionally introduce nontrivial smoothing heuristics. Their approach, called AdaHessian, has a theoretical periteration cost of at least
versus SGD, as it requires a second backward pass over the network; in practice, this cost is usually higher. Our algorithm often matches their accuracy, and (unlike AdaHessian) its cost depends exclusively on the model size irrespective of the underlying structure; thus, as we show in the experimental section, our average practical overhead is less than 50% for dense models, and less than 10% for sparse ones.Approximation Quality. Künstner et al. 2019kunstner performed an indepth analysis of the empirical Fisher, making the point that, in theory, the approximation could become meaningless if all sample gradients are zero. Similarly to 2020singh , we did not find that this occurs in practice for deep neural networks, as sample gradients are never zero. The latter reference also provides detailed comparisons of diagonal, KFAC, and other approximations, and finds that the empirical Fisher can provide competitive approximation quality for DNNs. Further, they demonstrate that better loss approximation implies better accuracy for neural network pruning. We therefore do not repeat their loss analysis, and mainly compare methods in terms of application performance.
3 Algorithm Description
3.1 The Static Algorithm
As a warmup, we first describe the IHVP algorithm for a static set of gradients. While our notation is customized for the empirical Fisher, all our techniques are applicable to any matrix that can be written as the sum of rankone components. Specifically, we are given vectors, which we assume to be the gradient vectors , and must compute quantities related to the inverse of the matrix resulting from the sum (more precisely, the average) of their outer products, which we denote by . The main idea is to rewrite the recursive description of such that, after some precomputation, the matrixvectorproduct can be computed efficiently for any vector . Using the ShermanMorrison formula applied to the partial empirical Fisher matrix corresponding to the first gradients (and scaled by ), we obtain the following recursion:
(3) 
For simplicity, let us set and unroll the ShermanMorrison recursion, which gives:
(4) 
IHVP Computation. Assuming that we have already computed all the vectors , the above expression can be calculated in time without any intermediate matrices: first compute the scalar fractions in (4), and then evaluate the resulting linear combination of and the . Since , it can be computed in exactly the same way given all for . Thus, all vectors of dimension can be precomputed in increasing order of using total time. We can halve the memory usage by also precomputing all denominators as this allows us to overwrite with .
Querying the Inverse. To extract individual elements of the inverse Hessian approximation, we can write in the form of (4) where and are indicator vectors
(5) 
As and can both be realized as constant time indexing operations, the above turns into a sum over scalars. Hence, our method admits access to any element of using the same precomputed and as for the efficient calculation of .
Additional Optimizations. The algorithm admits a fast vectorized implementation, and several optimizations, which we describe in the Appendix. For example, we perform several memorysaving optimizations, as well as intelligent explicit page swapping between CPU and GPU memory to mitigate memory transfer costs. Furthermore, the static algorithm can be applied independently to each block of a blockwise approximation: for block size , the runtime is reduced to . Then, as long as (i.e. number of gradients smaller than the block size), our method is times faster than the direct implementation of the Woodbury inverse.
3.2 The Dynamic Algorithm
We now describe the dynamic algorithm, which assumes that gradients arrive in an online fashion and must be integrated into the Fisher estimate. We first present the algorithm itself, i.e. the setup/update/IHVP computations, and perform a complexity analysis. Next, we show how the algorithm can be derived and finally we conclude with notes on an efficient practical implementation.
HighLevel View. The main idea of the dynamic algorithm is to write the IHVP as
(6) 
where the scalar coefficients can be computed efficiently just from the scalar products and . Then, any gradient can be replaced by updating just of the stored scalar product values. In the following, we use to denote the rowwise matrix of the gradients based on which we wish to compute products between the inverse empirical Fisher and an arbitrary vector. Further, for each from 1 to , let be a vector with components , such that
(7) 
i.e. containing the scalar coefficients of . This means, when , and otherwise.
Initial Precomputation & Update. The dynamic algorithm maintains three matrices: , and . The first is the symmetric matrix , composed of the dot products between gradients. The second is an upper triangular matrix and stores the precomputed values for . The third is the rowwise matrix of the coefficient vectors , which makes it lower triangular with a diagonal of . We now discuss how to compute those matrices.
The initial setup of the dynamic algorithm begins by evaluating in a straightforward fashion. Next, can be computed for according to the following recursion:
(8)  
(9) 
Given , we can then conclude the precomputation by calculating for recursively as:
(10) 
After the initial setup, gradient can be replaced with the gradient by first updating row of and then replacing row and column in with . Afterwards, the recomputation of columns of and rows of completes the update.
Multiplication. Once , and have been precomputed, one can perform efficient IHVPs of the form with arbitrary vectors . This is done by first evaluating and then computing all values by the following recursion:
(11)  
(12) 
Eventually, the final result of the IHVP is obtained by:
(13) 
Complexity Analysis. The dynamic algorithm stores gradients of dimension as well as three matrices (and a few vectors) and thus has an overall memory complexity of .
Next, we analyze the time complexity of all important operations. Initially, must be computed once, which takes time. The recursion of has three indices with and each step takes constant time. Thus, it can be computed in time with dynamic programming. Further, since values for index depend only on values for index , the dynamic programming can be implemented in space. has two indices and every recursion takes time, meaning that it can also be fully computed in through dynamic programming. Hence, the overall initial setup cost is .
To replace one gradient with , we have to compute as well as (partially) recalculate and , which takes at worst time. An IHVP requires two matrixvector products involving and a recursion with two indices and therefore has a complexity of .
Algorithmic Derivation. The dynamic algorithm can be derived directly from Theorem 1, which we now state formally and prove.
Theorem 1.
Let be a series of gradients, and, for any index , let be a dampened version of the empirical Fisher, where is a small constant. Then can be calculated as:
(14) 
where , and being defined such that
(15) 
Proof.
The proof makes use of the following two equalities:
(16)  
(17) 
which correspond to the Equation 6 with and the Equation 3 in an unrolled form, respectively. We begin by setting them equal. Next, we apply (16) again to (naming the corresponding coefficients ) and reorganize the nested sums to get a new expression of the form (16). Finally, we simplify the result by using our definition of for and (discussed at the beginning of Section 3.2). We obtain:
(18)  
(19)  
(20)  
(21)  
(22) 
We can directly match coefficients between (18) and (22), as they hold for any gradient values. This completes the proof. ∎
Equation (14) with is exactly equal to the IHVP computation (13) as the fraction in the innermost sum corresponds to . Similarly, index shifting and setting , thus turning into as well as the fraction’s numerator to , recovers the precomputation formula of given by (10) for . The formulas for and follow directly from an expansion of the Woodbury formula followed by an appropriate recursive evaluation that avoids any matrix/vector operations except in the base case (we indicate the corresponding recursive calls by brackets).
3.3 Efficient Implementation
Directly implementing the discussed recursive formulas in a modern machine learning framework would likely result in very slow code. Fortunately, it is possible to implement all computations required for the dynamic algorithm very efficiently on a GPU. We describe how to do this and provide a full implementation in
Code . Specifically, our implementation is able to perform the calculation of and , i.e. the component of the overall update cost, in milliseconds (on an NVIDIA RTX 2080 Ti) for values of as high as (and the code can still be optimized further). For reference, this is faster than the highly optimized SVD computation done at every step by GGT agarwal2019efficient . We emphasize that this computation being very fast in practice is crucial to reach low overheads, especially when dealing with a sparse model where the matrixvector products are no longer the bottleneck.4 Experimental Validation
We now validate our algorithms empirically, on pruning and optimization applications, respectively. Implementations of the algorithms are available at Code for the optimizer and oneshot pruning. The framework of NMCode implements a variant of the pruning algorithm, along with an integration into common PyTorch training flows for gradual pruning.
4.1 Application 1: Pruning DNNs using the Static Algorithm
Background. Given a trained model , the goal of pruning is to find the weight , or the set of weights, whose setting to would lead to a minimal increase in the training loss. Under a local quadratic approximation of the loss function, the OBD framework 1990lecun shows that the “optimal” weight to be removed is the one with the lowest value of the saliency metric and proposed to estimate by a diagonal approximation. The OBS framework 1992hassibi observed that the remaining (unpruned) weights should also be updated via the optimal perturbation where is the th basis vector. Our algorithm supports these operations in linearin time.
Wang et al. 2019wang raised the valid point that applying the OBS update from above to multiple weights being removed at once may be incorrect, since it ignores possible correlations between those weights. We considered this point in detail, comparing results between OBD pruning 1990lecun , the OBS update 1992hassibi , and an augmented version of OBS which disentangles the correlations by solving a corresponding linear system (2020singh, ). The results, presented in Appendix C.1, suggest that the OBD update is quite beneficial even in its approximate form, and that the effect of correlation is small for unstructured pruning. We use the augmented version of OBS, solving for correlations when feasible.
Experimental Setup. We prune CNNs (ResNet50 He_2016 and MobileNetV1 howard2017mobilenets
) on the ImageNet dataset
russakovsky2015imagenet . The models are standard in the pruning literature gale2019state , and therefore several strong baselines exist. We pick sparsity levels which allow us to compare against previous work. Timing experiments are run on a machine with NVIDIA RTX 2080 Ti GPUs, a 48core Intel CPU, and 512 GB of RAM. Following (2020singh, ), we have used batched gradients (of size 16) as single samples inside the Fisher approximation. This does not alter the nature of the results, but reduces variance.We compare against Global Magnitude Pruning (GMP) 1994hagiwara , Layerwise Optimal Brain Surgeon (LOBS) 2017dong , Soft Threshold Reparametrization (STR) 2020kusupati , and WoodFisher 2020singh . The latter two methods are stateoftheart for gradual pruning. WoodFisher is a special case of our method, using a block approximation. It is numerically equivalent at the same parameter settings, but has significantly higher computational and storage costs. Relative to it, we therefore focus on executing at higher parameter values. We compare against their public implementation. The Appendix contains full hyperparameters, additional experiments and ablations with respect to the block size and number of gradients.
Gradual pruning is performed in incremental steps, according to a polynomial schedule (2017zhu, ), followed by finetuning, where the learning rate is gradually decreased. We follow the hyperparameters of (2020singh, )
: pruning is performed in steps at 4 epochs apart for the first 24 epochs, followed by finetuning. In the following pruning experiments, we will adopt the standard approach of measuring
test accuracy, which is a more practical metric of interest. In all cases, the results correlate perfectly with respect to training accuracy, so we omit this latter metric.Pruning Results. Figure 0(a) shows the Top5 accuracy of oneshot ResNet50 pruned models for LOBS, GMP, WoodFisher, and MFAC, where our method is executed with a parameter setting that is infeasible for WoodFisher due to the required computational and storage costs. (We use Top5 accuracy for this experiment following 2017dong .) We note the improved accuracy of global methods relative to LOBS and layerwise magnitude pruning, and the fact that the Fisher approximation yields consistently better results relative to the global magnitude. This suggests that a better approximation of the Fisher inverse also yields better pruning results, and is in line with the loss approximation study of (2020singh, ). Generally, we found that estimating using more gradients always improves results, until saturation. However, for some models, smaller block sizes sometimes yield better results for a fixed number of gradients, presumably due to gradient scaling effects.
Figure 0(b) examines the effect of the improvement in oneshot pruning accuracy at each step of pruning, on the final result, when pruning MobileNetV1STR gradually to sparsity. WoodFisher uses block size , and gradients, while MFAC uses the same block size but gradients. Executing WoodFisher with the same parameters would be extremely slow. Note the gap in oneshot accuracy following each pruning step: this translates in the final accuracy gap of more than 1% even after the extensive finetuning.
Table 1 presents gradual pruning results for MobileNetV1STR/ImageNet at 89% sparsity and ResNet50/ImageNet at 95% sparsity, relative to other stateoftheart methods. MFAC outperforms previous methods in the case of MobileNetV1STR by more than 1.1% Top1 accuracy. Further, despite using more gradients to estimate the empirical Fisher, the perstep pruning cost of MFAC is lower than WoodFisher’s as can be seen in Table 2 (for the same parameter settings MFAC is faster). The ResNet50 results show similar gains, but by smaller margins.
Method  MBv1 Top1 (%)  RN50 Top1 (%) 

Dense baseline  72.00  77.01 
STR (2020kusupati, )  62.10  70.40 
Global Magnitude  62.94  71.72 
WoodFisher 2020singh  63.87  72.12 
MFAC  65.01  72.32 
Model  WF  MFAC  

MBv1STR  10k  400  60m  0.5m 
MBv1STR  10k  4k  OOM  8.7m 
MBv1STR  all  1k  OOM  1.2m 
ResNet50  2k  400  35m  2.5m 
ResNet50  10k  1k  OOM  14.0m 
ResNet50  all  512  OOM  4.7m 
NormalizerFree Nets.
We now examine the “compressibility” of the recentlyproposed normalizerfree nets, which have shown competitive accuracy based on a similar structure to standard residual networks but without the batch normalization component
(brock2021high, ). We use the PyTorch reimplementation of (rw2019timm, ). In Figure 1(a), we provide a relative comparison in terms of pruned accuracy between a normalizerfree and a regular version of ResNet50. These two networks have virtually the same number of parameters (approximately 25M); however, we notice that the normalizerfree variant is significantly “easier” to prune, in terms of the relative accuracy drop versus the dense baseline. We conjecture that this is because of the elimination of the BatchNorm layers. Specifically, when performing large oneshot pruning steps, the BatchNorm statistics become invalid following a pruning step, which can lead to a significant loss of accuracy; moreover, their removal may render the Fisher approximation more accurate, as the model loss is more stable in the local neighborhood.YOLOv3. Next, we study the effect of oneshot pruning on the YOLOv3SPP model yolov3
for object detection on the COCO dataset
lin2014microsoft . We use the stateoftheart implementation of (ultralytics, ). We oneshot prune this model using global magnitude (the only available baseline) and MFAC with block size 50K and 1K gradients. (This parameter value would be infeasible for WoodFisher due to storage costs.) The results in terms of the standard mAP@ metric are provided in Figure 1(b), and show that this model is quite stable under pruning, and that MFAC provides more accurate pruning for this model as well, relative to magnitude pruning.KFAC Pruning Comparison. In Figure 2(a) we compare pruning performance against a pruner which uses KFAC to estimate secondorder statistics (with and without dampening
) on a FullyConnected network and MNIST dataset. Notice the improved accuracy with MFAC (and WoodFisher) methods compared to the KFAC, even in the setting with a small network on a simple task.
Recomputation Effects. As noted before, our oneshot experiments stretch the OBD/OBS theory, as this approach implicitly assumes that the Hessian stays constant across the direction of pruning, which is unlikely to hold for very large pruning displacements. We can however examine the impact of this effect, by recomputing the Hessian along the direction of pruning. Figure 2(b) shows the effect of recomputation for the ResNet50/ImageNet model, for 5 recomputation steps, uniformly across the pruning direction. Notice the significant increase in accuracy for the resulting sparse models. We note that the results presented in other experiments do not use recomputation, for relative fairness to the global magnitude pruning, for which recomputation would not change the results.
4.2 Application 2: Optimization using the Dynamic Algorithm
MatrixFree Preconditioned SGD. The dynamic algorithm can provide an efficient implementation for the following variant of SGD. Let be the current iteration index, and be the length of the sliding window during which the estimation of the inverse Hessian is performed. Consider the iteration where is the learning rate. Let be the gradients obtained at steps , respectively. Then, we define the preconditioning matrix as that is, an empirical Fisher matrix generated with respect to a sliding window over the last gradients. At each step, our dynamic algorithm first removes and adds to the inverse estimate, and then computes .
We emphasize the fact that this optimizer does not employ momentum, and therefore only relies on the direct approximation of the curvature.
Setup and Objectives. We now experiment with the above optimizer, implemented using the dynamic algorithm, without momentum. We compare against standard optimizers such as SGD with momentum and Adam (kingma2014adam, ), but also against approximate secondorder methods such as KFAC martens2018kroneckerfactored , GGT (agarwal2019efficient, ), and AdaHessian (yao2020adahessian, )
. For the former two, we use the implementations contributed to TensorFlow
(abadi2016tensorflow, ) by the authors, while for the latter we use the authors’ PyTorch implementation. For fairness, we always follow the settings recommended by the authors (even though this makes overheads harder to compare); for the best accuracy on the models we considered, we gridsearched learning rates and weightdecay for each method in turn.We examine the loss and test accuracy for each method, on the standard ResNet20 and ResNet32 models (He_2016, )
on the CIFAR10 dataset
(cifar100, ) (see also Figures 3(a) and 3(b)). Since implementations have slightly different framework requirements, our main performance metric is overhead over SGD, measured in each algorithm’s environment, on an NVIDIA RTX 3090 GPU for PyTorch and Titan RTX for TensorFlow. For all secondorder methods, we compute the preconditioner at each step—reducing the frequency reduces overheads proportionally for each method, but introduces an additional hyperparameter. We provide code, as well as additional information, in the Appendix.Our first results are presented in Table 3 (left), where we examine the best Top1 test accuracy obtained by each method, as well as its overhead relative to SGD. For KFAC, the best accuracy is obtained with batch size 1000, while all other methods operate on batch size 128. Similarly, GGT’s best results are achieved with window size , while for MFAC larger window sizes improve results; we use . Due to these discrepancies, we measure overheads with “best accuracy” settings and also in the same setup as MFAC; the latter is given in parentheses. We emphasize the fact that, when available, we used the hyperparameter values recommended by the authors for the given model and dataset; otherwise, for fairness, we perform exactly the same grid search process for the tuning of each method.
In terms of accuracy, MFAC scores highest in the ResNet20 experiment, and comes third, to tuned SGD and AdaHessian (performing almost the same), on ResNet32. The overhead of MFAC is around , which is much lower than AdaHessian, on par with KFAC, but higher than GGT. However, these overheads are comparable only when GGT uses 5x less gradients (counterintuitively, larger window sizes sometimes make results worse for GGT). On the same settings, MFAC provides up to 5x lower overhead. We examined this performance gap in detail by breaking down performance per operation, and determined that the difference is due to: 1) the better performance of our dynamic algorithm relative to the SVD computation used by GGT; 2) the difference in complexity due to the term in GGT, versus our term. Similarly, KFAC’s overhead for the same batch size as MFAC would be substantially higher.
Table 3 (right) provides a more realistic scenario, where we run SGD, Adam, and MFAC without tuning or weightdecay on Wide Residual Networks (WRN) (zagoruyko2016wide, ) for CIFAR100, as well as on MobileNetV1/ImageNet, and examine test accuracy. MFAC achieves the highest accuracy on most runs, even on ImageNet. However, we emphasize that these results can probably be improved significantly by parameter tuning. Another interesting application of MFAC is finetuning sparse models. We finetune 90% sparse MobileNetV1/ResNet50 models for 30 epochs starting from the last gradual pruning step and find that MFAC reaches higher final accuracy than SGD in both cases. At the same time, it is only negligibly slower ( overhead) since our algorithm’s complexity is linear in the model density.
Wallclock time comparison.
In Figures 5 and 6 we show the test accuracies during the training of Wide ResNet (WRN) on the CIFAR100 dataset and MobileNetV1 on the ImageNet dataset with SGD, Adam and MFAC with respect to the wall clock time, when executed on a single NVIDIA RTX 3090 GPU. (In the case of the largest network, WRN 224, we use a second GPU’s memory to store gradients, but still use a single GPU for computation.) One can see that, in most plots, MFAC reaches better accuracies than SGD already after only slightly higher training time. Further, we can see that, for the WRN model, the early training accuracy increases fastest for MFAC even with respect to the wallclock time, a phenomenon often observed with methods which try to leverage approximate secondorder information.
5 Discussion
We presented static and dynamic linearin algorithms for computing IHVPs when the Hessian matrix can be approximated by a sum of rank1 matrices. We used the classic empirical Fisher approximation, but our results can apply more broadly. The main practical limitation of our approach is the cost of storing the additional gradients. For the static algorithm, we can leverage system memory to circumvent this limitation, via an efficient paging mechanism, which we describe in the Appendix. The dynamic algorithm could parallelize gradient storage, or perform gradient compression 2017sun . We plan to investigate this in future work, as well as a largerscale study of optimization performance.
Acknowledgments
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML).
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
 [2] Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. Efficient fullmatrix adaptive regularization. In International Conference on Machine Learning, pages 102–110. PMLR, 2019.
 [3] Shunichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
 [4] Shunichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
 [5] Jimmy Ba, Roger Grosse, and James Martens. Distributed secondorder optimization using kroneckerfactored approximations. 2016.
 [6] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. Highperformance largescale image recognition without normalization. arXiv preprint arXiv:2102.06171, 2021.
 [7] Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layerwise optimal brain surgeon, 2017.
 [8] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, 2010.
 [9] Glenn Jocher et al. YOLOv5 model implementations, AWS, Supervise.ly and YouTube integrations, April 2021.
 [10] Elias Frantar and Eldar Kurtic. MFAC implementation: {https://github.com/ISTDASLab/MFAC}, 2021.
 [11] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks, 2019.
 [12] Roger Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers, 2016.
 [13] Masafumi Hagiwara. A simple and effective method for removal of hidden units and weights. Neurocomputing, 6(2):207 – 218, 1994. Backpropagation, Part IV.
 [14] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, [NIPS Conference], page 164–171, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc.

[15]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 770–778, 2016.  [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.

[17]
Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam.
Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.
 [18] Neural Magic Inc. Sparseml framework: https://github.com/neuralmagic/sparseml, 2021.
 [19] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
 [20] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 2575–2583. Curran Associates, Inc., 2015.
 [21] Shankar Krishnan, Ying Xiao, and Rif A Saurous. Neumann optimizer: A practical optimization algorithm for deep neural networks. arXiv preprint arXiv:1712.03298, 2017.
 [22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
 [23] Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems, pages 4156–4167, 2019.
 [24] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity, 2020.
 [25] César Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation of fisher approximations beyond kronecker factorization, 2018.
 [26] Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal Brain Damage, page 598–605. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
 [27] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [28] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503–528, 1989.
 [29] Alexander Ly, Maarten Marsman, Josine Verhagen, Raoul Grasman, and EricJan Wagenmakers. A tutorial on fisher information, 2017.
 [30] James Martens. Deep learning via hessianfree optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 735–742, Madison, WI, USA, 2010. Omnipress.
 [31] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.

[32]
James Martens, Jimmy Ba, and Matt Johnson.
Kroneckerfactored curvature approximations for recurrent neural networks.
In International Conference on Learning Representations, 2018.  [33] James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature, 2015.
 [34] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
 [35] Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Largescale distributed secondorder optimization using kroneckerfactored approximate curvature for deep convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
 [36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
 [37] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv, 2018.
 [38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [39] Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient secondorder approximation for neural network compression, 2020.
 [40] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting. In Proceedings of the ThirtyFourth International Conference on Machine Learning, 2017.
 [41] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with dense networks and fisher pruning, 2018.

[42]
Valentin Thomas, Fabian Pedregosa, Bart Merriënboer, PierreAntoine
Manzagol, Yoshua Bengio, and Nicolas Le Roux.
On the interplay between noise and curvature and its effect on
optimization and generalization.
In
International Conference on Artificial Intelligence and Statistics
, pages 3503–3513. PMLR, 2020.  [43] Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kroneckerfactored eigenbasis, 2019.
 [44] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorchimagemodels, 2019.
 [45] Max A Woodbury. Inverting modified matrices. Statistical Research Group, 1950.
 [46] Zhewei Yao, Amir Gholami, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. Adahessian: An adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719, 2020.
 [47] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [48] Wenyuan Zeng and Raquel Urtasun. MLPrune: Multilayer pruning for automated neural network compression, 2019.
 [49] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017.
Appendix A Efficiently Implementing the Dynamic Algorithm
As mentioned in Section 3.2, directly implementing the various recursive formulas of the dynamic algorithm will most likely be quite slow in practice. Thus, we now discuss how to develop an efficient practical implementation. The reader can also find the full code for this implementation of the dynamic algorithm (and the corresponding optimizer) in PyTorch, available at [10].
Setup.
We begin by vectorizing the calculations of and . Algorithm 1 describes the full setup procedure in PyTorchlike pseudocode.
Initially, is the stored scalar product matrix scaled by . Row 1 already has the correct final values. The algorithm then goes through iterations, each completing one row but also updating all the rows of higher index. More concretely, in iteration , we subtract the outer product of the previous row starting at the element one after the diagonal with itself, scaled by the inverse of plus the previous diagonal element , from the square submatrix with as the upper left corner, i.e. . This process is also visualized in Figure 7, where the already calculated elements are shaded in light blue, the ones that are currently being updated in darker blue, and the irrelevant ones in grey. Since the calculation of , as given here, will also produce some unnecessary (and incorrect) values below the diagonal, we delete those after the loop by extracting just the upper triangular part of .
starts off as the identity matrix times
(and row 1 is again already done). Next, is calculated row by row, where row is a negative linear combination of the previous rows (up to index as is lower triangular) with the coefficients given by the first elements of column of , i.e. , each divided by plus the corresponding diagonal element of , i.e. . This process is also shown in Figure 7.In theory, the algorithm presented so far should be well suited for utilizing the massive parallel computation capabilities of a GPU. However, a straightforward implementation in e.g. PyTorch, will not be very fast for larger values of as it results in thousands of small kernel launches with overhead that adds up to a very significant overall slowdown. (Concretely, we measured overhead for ). This means that, to achieve good performance, the setup for the coefficient computation of the dynamic algorithm needs to be implemented with (custom) CUDA kernels that merge computations to get rid of the excessive kernel launch overhead.
Carefully studying the calculation process of reveals that it is actually very similar to a Cholesky decomposition (concretely, the only difference is that it does not use the square root of the diagonal element but the element itself plus ). In fact, it is actually exactly equivalent to taking the upper triangular matrix in the LUdecomposition (with no pivoting) of and then subtracting again. This means that it is possible to reuse the highly optimized LUdecomposition kernel provided by PyTorch to calculate quite efficiently (a custom kernel implementing the modified Cholesky decomposition would certainly be faster, but we already observed pretty low overheads with the LU version). For the efficient computation of , a custom kernel is needed, which we provide at [10], and describe next.
The main idea is that we can dramatically reduce the number of separate kernel launches by computing multiple rows of in a single kernel. In general, we split into blocks of (this matches the fact that modern NVIDIA GPUs have 1024 threads per block). Then, we begin by fully computing all diagonal blocks in parallel. Notice that, due to the lower triangular structure of , those are fully independent. Similarly, each block depends exclusively on the blocks above it (up to the diagonal). Next, we perform an iterative process which will update all remaining blocks with respect to the new values of the most recently computed diagonal before completing the calculation of the blocks adjacent to it. This new diagonal is then the reference for the next iteration. Figure 8 visualizes this process. The calculation within the individual blocks is easy to parallelize. For more details, please refer to our code for this CUDA kernel. Overall, this way of calculating is quite fast; e.g. for , it takes milliseconds to execute on an NVIDIA RTX 3090 GPU.
IHVPs
After discussing an efficient implementation of the precomputation steps, we now focus on actually computing IHVPs. Algorithm 2 presents a vectorized implementation.
We note that if , then , i.e. the last column of divided elementwise by the diagonal values. This can be utilized for a more efficient combined updatemultiply operation, which is essential for the MFAC optimizer, where we first update our Hessian estimation with a gradient and then return the IVHP with this same . Further, reusing the same for both updating as well as calculating , saves one redundant matrixvectorproduct.
Appendix B Efficiently Implementing the Static Algorithm
Although the formulas discussed in the main body of the paper have low theoretical time complexity, a direct implementation does not make the best use of modern machine learning frameworks and GPUs. Therefore, we now present more efficient matrixbased versions. We note that a blockwise version would simply apply the techniques discussed here independently to each block.
Let be the rowwise matrix of the and the vector of the . Then, can be written as follows, where denotes the elementwise division:
(23) 
It is also possible to extract one element of each matrix row simultaneously (in terms of vectorized operations) with cost, for example the entire diagonal. Let be a mapping such that where is the column of the element to select in row , e.g. to get the diagonal, and a vector such that if and otherwise. Using these definitions, the calculation of the desired result is described below, where denotes the elementwise product.
(24) 
Our implementation contains several additional optimizations. For instance, we can efficiently precompute and by repeatedly applying (23) to produce the next . In the next section, we additionally discuss several memorysaving optimizations, and in particular explicit page swapping between CPU and GPU memory for situations where does not fully fit in the GPU memory.
Finally, Algorithm 3 demonstrates, using a Pythonlike matrix indexing syntax, how to efficiently precompute and by repeatedly applying (23) to produce the next . It should be noted that the rowwise matrix of gradients is repurposed as , a simple way to halve the peak memory consumption in practice.
b.1 Additional Optimizations
As we have seen, obtaining a good Fisher approximation of the Hessian requires a sizable number of gradients, which means that, for bigger networks, it can easily happen that the collection of all gradients used for Fisher estimation does not fully fit into GPU memory. We now discuss how to efficiently handle such situations, with an implementation that performs explicit swapping of gradient blocks between GPU and CPU memory (RAM). The most important steps of the method to be discussed are also visualized in Figure 9.
In general, a simple trick to halve the peak memory consumption of the static algorithm is repurposing the gradient matrix as , which is possible as is not needed anymore after and have been calculated. Now, for the explicit swapping implementation we first split the collection of gradients into blocks / pages containing at most gradients each. Those are then turned into the corresponding blocks in increasing order of . Additionally, we maintain a single buffer block of the same size for accumulating intermediate results. All blocks reside in CPU memory and are only loaded to the GPU as needed, meanwhile is small enough to be kept in GPU memory at all times. To compute we first load block fully into GPU memory. Then we load the first gradient of denoted by and compute the corresponding with respect to the loaded , which is afterwards saved in the first index of the buffer . After repeating this for all , block is swapped with block and the whole process starts again, accumulating the resulting into the buffer. Eventually, after handling , we can load into GPU memory, finish the calculation of the and store them in (reusing the memory of ). It should be noted that the loading of can be parallelized with the calculation of and thus costs almost no extra time. Finally, one wants to choose the number of blocks to be as small as possible, to minimize the overhead caused by the page swaps.
Overall, the implementation described above is effective in practice and allows scaling up our static algorithm (with only modest overhead over a pure GPU implementation), in its full nonblockwise version, to a large number of gradients, even for relatively large networks such as ResNet50.
Lastly, we note that, for a blockwise implementation, where a single block and all its corresponding gradient parts, easily fit into GPU memory (i.e. the opposite of the situation described thus far in this section), it can be beneficial to handle multiple blocks simultaneously via batchmatrixmultiplies. We provide an implementation that can do this, leading to very significant speedups on lower blocksizes where otherwise most of the time is spent constantly loading in new memory from the CPU.
Appendix C Additional Experimental Results
c.1 Pruning Update Approximation
As discussed in the main paper, we use the OBS pruning framework but prune multiple weights at the same time, which as pointed out by [43], in extreme cases, is not guaranteed to be better than OBD due to not taking into account potential correlations. Thus, we now explore how much of a problem this is in practice. To do that, we compare three different MFAC variations: not updating the nonpruned weights at all (i.e. OBD, but with a better Hessian approximation), jointly applying the OBS update for all pruned weights (i.e. the OBS approximation) and applying the correct simultaneous update by solving a large linear system (see [39] for details). We prune a ResNet20/CIFAR10 model in steps of 10% down to 90% sparsity, using three variations of the full block size MFAC with while recomputing the inverse Hessian approximation after every pruning step. The recomputations ensure that we have a reasonably accurate loss approximation at every pruning step and that the number of weights pruned at each step is small enough to solve the corresponding linear system. Table 4 provides the results.
Method  10%  20%  30%  40%  50%  60%  70%  80%  90% 

No update (OBD)  91.4  91.3  91.0  90.2  88.4  84.8  77.2  50.4  11.5 
Simultaneous OBS  91.4  91.4  91.5  91.2  90.8  89.7  86.8  75.6  28.2 
Linear solving  91.5  91.4  91.4  91.2  90.7  89.7  87.5  75.7  28.1 
The results show a very clear gap in accuracy between the noupdate and the approximate update version. At the same time, there appears to be almost no difference between the simultaneous OBS and the true OBS update, which has to solve a (potentially very large) linear system. This suggests that the simultaneous update approximation done for computational tractability is also a reasonable choice in terms of achieved accuracy.
c.2 Ablation Studies for OneShot Pruning
To further examine the properties of the local loss approximation, we now present ablation studies on pretrained ResNet50/ImageNet and MobileNetV1STR/ImageNet models. Experiments perform oneshot pruning according to the OBS metric estimated using MFAC, with varying block size and number of gradients. The dampening factor in these experiments is set to . Following [39], we used batched gradients (of size 16) as single samples inside the Fisher approximation. (This does not alter results, but reduces variance.)
The goal of these experiments is to examine two questions: 1) does larger block size always imply a better approximation and 2) does higher number of gradients always imply a better approximation? We will see that neither of these questions have obvious answers, and will discuss a metaalgorithm for choosing the block size and number of gradients.
The numbers presented for our oneshot experiments are the averages over two runs. As the variance is very small, we omit error bars. We sometimes break the convention and “zoom into” the y axis for visibility.
c.2.1 ResNet50/ImageNet
The first set of experiments examines the dependency on the number of gradients and block size for the ResNet50/ImageNet model. The left subfigure in Figure 10 shows results for block sizes between and all weights, i.e. , for a fixed number of gradients, while the right subfigure shows the same, but for gradients. The first graph presents a fairly unintuitive finding, i.e. that lower block sizes appear to provide better pruning accuracy. We analyze this finding in detail and explain it in Section C.3: roughly, this is due to the fact that gradient entries are scaled by the gradient norm over the block. As predicted by our analysis, this effect is mitigated by increasing the number of gradients used for estimation, in which case block size yields the best results, and the performance of full block size also improves. Please see Section C.3 for the full analysis.
Figure 11 examines the complementary effect, that of number of gradients for a fixed block size (10K and 100K, respectively). The results suggest that more gradients help improve the estimation of the Fisher matrix, although we observe a saturation effect. We also note that, at higher oneshot sparsities (e.g., 60% in oneshot, not shown for visibility), this effect does not always hold. However, for such large pruning “perturbations” the pruning assumption that the Hessian is constant along the direction of pruning is unlikely to hold, which affects the stability of the results.
c.2.2 MobileNetV1STR/ImageNet
The second set of experiments shows the dependency on the number of gradients and block size for the MobileNetV1STR/ImageNet model. We use the implementation and pretrained weights from [24]. Figure 12 shows results for block sizes between and all weights, i.e. , for a fixed number of gradients ( in the left subfigure and in the right subfigure). Again, in both cases, for this model lower block sizes appear to provide an improved approximation. This is due to the gradient scaling effects which we discuss in Section C.3. However, the results show that for this compact model these effects are more prevalent than for the ResNet50 model.
Figure 13 examines the opposite effect, that of the number of gradients for a fixed block size ( in the left subfigure and in the right subfigure). We show results for the number of gradients varying between and . The results clearly suggest that more gradients help improve the accuracy, although the improvement appears to saturate, e.g. between and gradients.
c.3 Discussion
Normalization Effects.
We first discuss our finding that, in the case of some models, e.g. MobileNetV1, smaller blocks appear to yield better accuracy. This can be explained by examining the recursive form of the elements of the diagonal inverse. Specifically, without blocking, we get that the th diagonal element has the form
where and represent the th element of a gradient. Specifically, notice that, in the case of “full” block size (“all” or ), the th gradient entry is divided by the full gradient norm, which may cause it to become negligible if the norm is large. In turn, this leads to essentially the magnitude pruning ranking and update. By contrast, in the case of smaller blocks, the entry is only divided by the norm of the gradient over the block, which mitigates this normalization issue. Similarly, using more gradients also helps, since there are more summands in the above expression, which allows us to deviate from the magnitude pruning baseline.
Choosing the block size value.
Given this, our procedure for choosing the best block size hyperparameter value is to simply perform a oneshot pruning ablation study for this parameter, usually at the maximum gradient count at which one can reasonably execute, at a moderate sparsity (e.g., 40% from a pretrained model). We then choose the best performing block size, in terms of the model’s evaluation metric. We note that these single pruning steps are quite fast when using our implementation.
c.4 Gradual Pruning Experiment Hyperparameters
We begin by stating hyperparameters for our gradual pruning runs. For both MobileNetV1STR and ResNet50STR, we employ
identical hyperparameter values to WoodFisher [39], so that our results are fully comparable. The only parameters we change are the ones which pertain to our algorithm, in particular block size and number of gradients .For the MobileNetV1STR gradual experiments, pruning starts from the fullytrained model used by STR/WoodFisher, whose Top1 validation accuracy is 72%. Gradual pruning is then performed over a total of 24 epochs, pruning additional weights every epochs, followed by finetuning until the next pruning step. Pruning targets are set following the standard polynomial schedule of [49, 11]. Unless otherwise stated, SGD is used for finetuning. The learning rate during pruning is , momentum is set to , and weight decay is . Following the last pruning step, finetuning is applied until epoch . The initial learning rate value of is decayed multiplicatively by a factor of , every epoch, starting from the epoch .
For ResNet50STR gradual experiments, pruning starts from the fullytrained model used by STR/WoodFisher, whose Top1 validation accuracy is 77.01%. Gradual pruning is then performed over a total of epochs, pruning additional weights every epochs, followed by finetuning until the next pruning step. Pruning targets are set following the standard polynomial schedule of [49, 11]. Unless otherwise stated, SGD is used for finetuning. The learning rate during pruning is , momentum is set to , and weight decay is . Following the last pruning step, finetuning is applied until epoch . The initial learning rate value of is decayed multiplicatively by a factor of , every epochs, starting from epoch until epoch .
c.5 Optimization Experiment Hyperparameters
ResNet20/32
We now discuss the hyperparameters used for the ResNet20/32 comparison between MFAC and various first and second order optimizers. Both models are trained with batchsize 128 for 164 epochs ( steps) while dropping the learning rate by a factor of after and of training. This is exactly the training setup used by [16]. The best initial learning rate and weight decay values were determined by starting from the values suggested by the authors of the respective methods; in addition, we have performed a grid search around these values to examine whether better results can be obtained. In particular, Table 5 summarizes the final hyperparameters settings for SGD, Adam, AdaHessian and MFAC.
Method  learning rate  momentum  weight decay 

SGD  0.1  0.9  0.0001 
Adam  0.001  (0.9, 0.999)  – 
AdaHessian  0.015  (0.9, 0.999)  0.003 (RN20), 0.0005 (RN32) 
MFAC ()  0.001  –  0.003 
All experiments were repeated 5 times with different random seeds and we report the median of the best test accuracy; the standard deviations were generally quite low at around
percent accuracy.Hyperparameters for GGT and KFAC.
Unfortunately, GG
Comments
There are no comments yet.