Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization

Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension d, if the Hessian is given as a sum of m rank-one matrices, using O(dm^2) precomputation, O(dm) cost for computing the IHVP, and query cost O(m) for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost O(dm + m^2) for computing the IHVP and O(dm + m^3) for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [10] and [18].

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

page 21

04/29/2020

WoodFisher: Efficient second-order approximations for model compression

Second-order information, in the form of Hessian- or Inverse-Hessian-vec...
02/24/2021

Learning-Augmented Sketches for Hessians

Sketching is a dimensionality reduction technique where one compresses a...
10/19/2021

SOSP: Efficiently Capturing Global Correlations by Second-Order Structured Pruning

Pruning neural networks reduces inference time and memory costs. On stan...
06/20/2020

Training (Overparametrized) Neural Networks in Near-Linear Time

The slow convergence rate and pathological curvature issues of first-ord...
10/15/2019

Adjoint-based exact Hessian-vector multiplication using symplectic Runge–Kutta methods

We consider a function of the numerical solution of an initial value pro...
06/14/2021

NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning

In this paper, a novel second-order method called NG+ is proposed. By fo...
02/17/2022

A Faster Interior-Point Method for Sum-of-Squares Optimization

We present a faster interior-point method for optimizing sum-of-squares ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given the recent success and increasing impact of deep learning, there has been significant work on improving the fundamental technical tools underpinning its progress. One such tool is the ability to estimate the local geometry of the loss function for deep models, which often comes in the form of estimates for second-order (Hessian) information. Such information is critical in several settings, such as neural network optimization and pruning.

Directly using Hessian information in the context of deep learning is infeasible: for example, just storing the Hessian matrix for the standard ResNet50 model 2016-he would occupy 2.5 Petabytes of memory. These constraints have inspired significant work on efficient numerical approximations of the Hessian for deep neural networks, for example the line of work on the K-FAC approximation 2015-kingma ; 2015-martens ; 2016-ba ; 2019-zheng , or efficient block-wise approximations yao2020adahessian ; 2020-singh . One of the classic approaches, which we focus on in this paper, is the empirical Fisher (1992-hassibi, ; 1998-amari, ; amari2016information, ) approximation to the Hessian, written as:

(1)

where is the number of samples, denotes the gradient w.r.t. the th sample at the given point, and is the outer product of individual gradients, which we view as column vectors.

The empirical Fisher approximation is fairly standard, e.g. 1992-hassibi ; 1998-amari ; amari2016information , and has been recognized to be useful in a variety of practical settings where exact estimation of the Hessian or of the true Fisher information matrix is not feasible. At the same time, there is still active research in the community on the conditions for its applicability 2014-martens ; 2019-kunstner ; 2020-singh ; thomas2020interplay .

A useful property of this approximation is that it also allows to estimate the inverse of the Hessian, which is essential in many applications. Specifically, the fact that the empirical Fisher can be written as a sum of rank-one matrices allows the use of the Woodbury-Sherman-Morrison inversion formula woodbury1950inverting to exactly compute its inverse, by recursively integrating terms corresponding to each gradient into the inverse. (Please see Equation 2 for an exact derivation.) This approach was independently proposed by 1992-hassibi ; 1998-amari , for pruning and optimization, respectively, where it was validated on small networks, with hundreds of weights.

The idea was adapted to deep neural networks (DNNs) by 2020-singh , through approximation of the inverse in small diagonal blocks. The authors show improved approximation quality for the Hessian inverse relative to a simple diagonal approximation, and that this leads to competitive pruning results in terms of accuracy versus average sparsity level. Yet, this approach is limited by the block-wise approximation: for block size and dimension , it requires time to recursively build the block-wise Fisher approximation using gradients, and time and memory for computing the Inverse-Hessian-Vector-Products (IHVPs) necessary for estimating pruning statistics. Clearly, the ideal case is still intractable at scale, and generally it is still unknown whether linear-time algorithms are possible for computing IHVPs in this context.

Contribution.   We resolve this question by introducing two efficient algorithms for computing IHVPs under the empirical Fisher approximation, with computational and storage costs that are linear in the dimension of the model, without the need for block-wise approximations. Concretely, we provide exact matrix-free algorithms to compute products of the form where is the inverse empirical Fisher, and is an arbitrary vector. We show that these algorithms can be implemented efficiently at scale, and that they can lead to more accurate neural network pruning and optimization.

The Static Algorithm.   Our first algorithm assumes a static scenario, which is standard in neural network pruning: we are given a fully-trained model , for which we wish to estimate IHVPs and diagonal elements of the inverse Hessian, in order to determine the “optimal” pruning update using e.g. the Optimal Brain Surgeon (OBS) framework (1990-lecun, ; 1992-hassibi, ). For this, we first compute gradients at , which we will use to estimate IHVPs via the empirical Fisher and compute pruning statistics. The main idea is that, since we only wish to compute products between the inverse and an arbitrary vector (i.e. IHVPs), we can rewrite the Woodbury recursion such that we work exclusively with individual vectors and scalars, and never with full matrices. Given model dimension and gradients, each defining a rank-one component of the empirical Fisher, the algorithm uses pre-computation time, and will have cost for exactly computing the IHVP. Further, we can specialize the algorithm to directly query elements of the Hessian inverse, at a cost of time per element. This provides efficient, linear-in- implementations for all operations required by the OBS pruning framework.

The Dynamic Algorithm.   Next, we adapt this approach to SGD optimization, i.e. to precondition stochastic gradients by our estimate of the inverse Hessian. In this case, we use the classic idea of bootstrapping the approximation by leveraging previous gradients: the preconditioner at time is built from gradients obtained during a “sliding window” over the last optimization steps. This requires a dynamic representation, allowing addition and removal of gradients without full recomputation of second-order statistics.

Our main technical contribution is showing that this can be achieved, with approximately time and space complexity. The key idea is that, for any set of gradients, and any vector , it is possible to represent the corresponding IHVP estimate as a linear combination of terms corresponding to individual gradients and , of the form

Crucially, we ensure that the coefficients can be computed just via dot products , where in the ordering, and . Then, to replace a given gradient from this representation, we just have to compute scalar products with the new gradient vector, and to update some intermediate information. Hence, the entire update operation has computational cost for adding or removing any gradient from the sliding window, and for computing the IHVP.

Implementation and Experiments.   We provide an efficient vectorized implementation for the above algorithms, called M-FAC (for Matrix-Free Approximate Curvature

) as well as pruning and optimization code, in PyTorch 

paszke2019pytorch . Our implementation includes several additional optimizations, in particular GPU acceleration with custom CUDA code, and minimizes the cost of memory transfer between the GPU and main memory via a non-trivial gradient paging mechanism. For pruning operations, our implementation provides order-of-magnitude improvements in terms of computation time over the block-wise approximation of 2020-singh

for classic pruning benchmarks such as ResNet50 and MobileNet on the ImageNet dataset. In turn, this allows us to obtain more accurate pruned models by exploiting higher numbers of gradients and larger block sizes. What is more, our preconditioned SGD can be competitive with state-of-the-art optimizers for deep learning in terms of accuracy, and has end-to-end computational overhead per step of 5%–55% relative to vanilla SGD.

2 Preliminaries and Related Work

General Definitions.   We now briefly introduce the setting and notation; we refer the reader to standard texts 2014-martens ; ly2017tutorial ; amari2016information for a complete introduction. We start from the standard setting in which we are given a dataset , and wish to identify a -dimensional model to minimize an empirical loss , defined as . For a twice-differentiable loss , the Hessian is the matrix of second-order derivatives of w.r.t. , i.e. . In the probabilistic view, each input example

has some probability of being assigned a given label

. Given input examples drawn from a distribution and corresponding outputs drawn from a conditional distribution

, the goal is to minimize the distance between the target joint distribution

, and a learned joint distribution , where is the model.

The Fisher Matrix.   Assuming the probabilistic view, it can be shown that the Fisher information matrix of the model’s joint distribution satisfies . If the model’s output conditional distribution matches the conditional distribution of the data, then the Fisher and Hessian matrices are in fact equivalent ly2017tutorial . Roughly, this can be interpreted as saying that, if the model has high accuracy, one can approximate the Hessian of with the Fisher matrix. Further, it is sometimes useful to consider an approximation to the Fisher, where the distribution is replaced with the empirical training distribution , leading to the empirical Fisher matrix :

The Woodbury-Sherman-Morrison Trick.   The fact that this approximation is a sum of rank-one matrices has the benefit that it allows efficient exact computation of the Fisher inverse. Specifically, one can apply the classic Woodbury-Sherman-Morrison formula to compute the inverse recursively as

(2)

where the recursion is over samples, and the base step is , with being a small positive constant. This approach was independently introduced by Hassibi and Stork 1992-hassibi for pruning, and Amari 1998-amari for natural gradient. Singh and Alistarh 2020-singh showed that it can be scaled to DNNs by block-wise approximation, and that it provides better approximations of the loss than K-FAC-based 2019-wang , or diagonal 2018-theis ; 2017-dong approximations, leading to more accurate pruning. The main shortcoming of directly applying (2) is the prohibitive computational cost of , even when reduced to by -block-wise approximation. Our method proposes a matrix-free approach for exactly calculating IHVPs and querying individual elements of the inverse Hessian, of cost and , respectively, after precomputation. Our method is numerically equivalent to the direct computation above.

Diagonal Approximations.   A common approximation, both for optimization, e.g. krishnan2017neumann ; kingma2014adam but also for pruning 2018-theis , is to assume that the Fisher matrix is diagonal. In our notation, this method has setup cost , and diagonal query cost . However, as evident from the experimental results of 2020-singh this provides lower approximation quality relative to both block-wise methods or K-FAC 2019-wang .

K-FAC Approximations.   This approach observes that the entries of the true Fisher corresponding to blocks “between” two layers, which can be written as the expectation of a Kronecker product between two matrices, can be approximated as the Kronecker product of the expectations of those two matrices (reversing the order between the product and the expectation). This approximation has been leveraged for both pruning and for optimization, and it allows efficient computation of the inverse 2016-ba ; Osawa_2019 ; 2019-zheng ; 2019-wang ; laurent2018an ; however, it is known that this approximation does not always hold 2015-martens . Another relative drawback is that the Kronecker factorization only occurs naturally for fully-connected layers; though, there is work on extensions to other layer types grosse2016kroneckerfactored ; martens2018kroneckerfactored , via additional approximations. We compare against K-FAC-based optimizers and pruners, and find that our method yields better results.

Additional Approaches.   Our approach is similar in spirit to matrix-free methods nocedal2006numerical ; liu1989limited ; martens_free , but the algorithms we present are new. Hessian-free optimization martens_free also forgoes the explicit computation of Hessians in favor of computing an IHVP with a vector . However, this estimation is performed by iteratively approximating the solution to the linear system for some given without ever explicitly forming . One disadvantage of this method in practice is that it requires tuning and several costly iterations to converge (for a single ), as the underlying Hessian can be ill-conditioned. The L-OBS pruning method 2017-dong approximates second-order information by defining independent layer-wise objectives, which allows the direct approximation of the layer-wise Hessians via a carefully-crafted block structure. In contrast, our approach allows for fully-global Hessian estimation, and yields better pruning results at scale.

Full-matrix adaptive regularization has similar goals, but in the context of adaptive optimizers Duchi2010AdaptiveSM . Agarwal et al. agarwal2019efficient proposed GGT, which allows the efficient computation of the inverse square root of the low-rank matrix resulting from the sum of gradient outer products over a sliding window. At its core, this procedure requires an eigen-decomposition (the authors take an SVD) of an matrix at every time step, which is reasonably efficient for small values of .

Our dynamic algorithm solves a similar, but slightly simpler problem, as we only want to invert the matrix (rather than also computing its square root). However, we do not perform any eigen-decompositions; instead, we carefully maintain intermediate information that allows an efficient explicit computation of the scalar coefficients in Equation 6. At the end of Section 3.2, we discuss how this approach is considerably more efficient in practice and allows executing at larger window sizes (with small overhead), which leads to improved model accuracy as we show in our experiments. Our dynamic algorithm has per-step cost relative to for GGT (agarwal2019efficient, ); this can also result in lower overhead versus GGT, even when the SVD cost is negligible (e.g. if is small).

Yao et al. yao2020adahessian

recently provided an alternative method for approximating the diagonal of the inverse Hessian, using Hutchinson’s randomized algorithm for estimating the diagonal. To mitigate the variance, the authors additionally introduce non-trivial smoothing heuristics. Their approach, called AdaHessian, has a theoretical per-iteration cost of at least

versus SGD, as it requires a second backward pass over the network; in practice, this cost is usually higher. Our algorithm often matches their accuracy, and (unlike AdaHessian) its cost depends exclusively on the model size irrespective of the underlying structure; thus, as we show in the experimental section, our average practical overhead is less than 50% for dense models, and less than 10% for sparse ones.

Approximation Quality.   Künstner et al. 2019-kunstner performed an in-depth analysis of the empirical Fisher, making the point that, in theory, the approximation could become meaningless if all sample gradients are zero. Similarly to 2020-singh , we did not find that this occurs in practice for deep neural networks, as sample gradients are never zero. The latter reference also provides detailed comparisons of diagonal, K-FAC, and other approximations, and finds that the empirical Fisher can provide competitive approximation quality for DNNs. Further, they demonstrate that better loss approximation implies better accuracy for neural network pruning. We therefore do not repeat their loss analysis, and mainly compare methods in terms of application performance.

3 Algorithm Description

3.1 The Static Algorithm

As a warm-up, we first describe the IHVP algorithm for a static set of gradients. While our notation is customized for the empirical Fisher, all our techniques are applicable to any matrix that can be written as the sum of rank-one components. Specifically, we are given vectors, which we assume to be the gradient vectors , and must compute quantities related to the inverse of the matrix resulting from the sum (more precisely, the average) of their outer products, which we denote by . The main idea is to rewrite the recursive description of such that, after some precomputation, the matrix-vector-product can be computed efficiently for any vector . Using the Sherman-Morrison formula applied to the partial empirical Fisher matrix corresponding to the first gradients (and scaled by ), we obtain the following recursion:

(3)

For simplicity, let us set and unroll the Sherman-Morrison recursion, which gives:

(4)

IHVP Computation.   Assuming that we have already computed all the vectors , the above expression can be calculated in time without any intermediate matrices: first compute the scalar fractions in (4), and then evaluate the resulting linear combination of and the . Since , it can be computed in exactly the same way given all for . Thus, all vectors of dimension can be precomputed in increasing order of using total time. We can halve the memory usage by also precomputing all denominators as this allows us to overwrite with .

Querying the Inverse.   To extract individual elements of the inverse Hessian approximation, we can write in the form of (4) where and are indicator vectors

(5)

As and can both be realized as constant time indexing operations, the above turns into a sum over scalars. Hence, our method admits access to any element of using the same precomputed and as for the efficient calculation of .

Additional Optimizations.   The algorithm admits a fast vectorized implementation, and several optimizations, which we describe in the Appendix. For example, we perform several memory-saving optimizations, as well as intelligent explicit page swapping between CPU and GPU memory to mitigate memory transfer costs. Furthermore, the static algorithm can be applied independently to each block of a block-wise approximation: for block size , the runtime is reduced to . Then, as long as (i.e. number of gradients smaller than the block size), our method is times faster than the direct implementation of the Woodbury inverse.

3.2 The Dynamic Algorithm

We now describe the dynamic algorithm, which assumes that gradients arrive in an online fashion and must be integrated into the Fisher estimate. We first present the algorithm itself, i.e. the setup/update/IHVP computations, and perform a complexity analysis. Next, we show how the algorithm can be derived and finally we conclude with notes on an efficient practical implementation.

High-Level View.   The main idea of the dynamic algorithm is to write the IHVP as

(6)

where the scalar coefficients can be computed efficiently just from the scalar products and . Then, any gradient can be replaced by updating just of the stored scalar product values. In the following, we use to denote the row-wise matrix of the gradients based on which we wish to compute products between the inverse empirical Fisher and an arbitrary vector. Further, for each from 1 to , let be a vector with components , such that

(7)

i.e. containing the scalar coefficients of . This means, when , and otherwise.

Initial Precomputation & Update.   The dynamic algorithm maintains three matrices: , and . The first is the symmetric matrix , composed of the dot products between gradients. The second is an upper triangular matrix and stores the precomputed values for . The third is the row-wise matrix of the coefficient vectors , which makes it lower triangular with a diagonal of . We now discuss how to compute those matrices.

The initial setup of the dynamic algorithm begins by evaluating in a straightforward fashion. Next, can be computed for according to the following recursion:

(8)
(9)

Given , we can then conclude the precomputation by calculating for recursively as:

(10)

After the initial setup, gradient can be replaced with the gradient by first updating row of and then replacing row and column in with . Afterwards, the recomputation of columns of and rows of completes the update.

Multiplication.   Once , and have been precomputed, one can perform efficient IHVPs of the form with arbitrary vectors . This is done by first evaluating and then computing all values by the following recursion:

(11)
(12)

Eventually, the final result of the IHVP is obtained by:

(13)

Complexity Analysis.   The dynamic algorithm stores gradients of dimension as well as three matrices (and a few vectors) and thus has an overall memory complexity of .

Next, we analyze the time complexity of all important operations. Initially, must be computed once, which takes time. The recursion of has three indices with and each step takes constant time. Thus, it can be computed in time with dynamic programming. Further, since values for index depend only on values for index , the dynamic programming can be implemented in space. has two indices and every recursion takes time, meaning that it can also be fully computed in through dynamic programming. Hence, the overall initial setup cost is .

To replace one gradient with , we have to compute as well as (partially) recalculate and , which takes at worst time. An IHVP requires two matrix-vector products involving and a recursion with two indices and therefore has a complexity of .

Algorithmic Derivation.   The dynamic algorithm can be derived directly from Theorem 1, which we now state formally and prove.

Theorem 1.

Let be a series of gradients, and, for any index , let be a dampened version of the empirical Fisher, where is a small constant. Then can be calculated as:

(14)

where , and being defined such that

(15)
Proof.

The proof makes use of the following two equalities:

(16)
(17)

which correspond to the Equation 6 with and the Equation 3 in an unrolled form, respectively. We begin by setting them equal. Next, we apply (16) again to (naming the corresponding coefficients ) and reorganize the nested sums to get a new expression of the form (16). Finally, we simplify the result by using our definition of for and (discussed at the beginning of Section 3.2). We obtain:

(18)
(19)
(20)
(21)
(22)

We can directly match coefficients between (18) and (22), as they hold for any gradient values. This completes the proof. ∎

Equation (14) with is exactly equal to the IHVP computation (13) as the fraction in the innermost sum corresponds to . Similarly, index shifting and setting , thus turning into as well as the fraction’s numerator to , recovers the precomputation formula of given by (10) for . The formulas for and follow directly from an expansion of the Woodbury formula followed by an appropriate recursive evaluation that avoids any matrix/vector operations except in the base case (we indicate the corresponding recursive calls by brackets).

3.3 Efficient Implementation

Directly implementing the discussed recursive formulas in a modern machine learning framework would likely result in very slow code. Fortunately, it is possible to implement all computations required for the dynamic algorithm very efficiently on a GPU. We describe how to do this and provide a full implementation in 

Code . Specifically, our implementation is able to perform the calculation of and , i.e. the component of the overall update cost, in milliseconds (on an NVIDIA RTX 2080 Ti) for values of as high as (and the code can still be optimized further). For reference, this is faster than the highly optimized SVD computation done at every step by GGT agarwal2019efficient . We emphasize that this computation being very fast in practice is crucial to reach low overheads, especially when dealing with a sparse model where the matrix-vector products are no longer the bottleneck.

4 Experimental Validation

We now validate our algorithms empirically, on pruning and optimization applications, respectively. Implementations of the algorithms are available at Code for the optimizer and one-shot pruning. The framework of NMCode implements a variant of the pruning algorithm, along with an integration into common PyTorch training flows for gradual pruning.

4.1 Application 1: Pruning DNNs using the Static Algorithm

Background.   Given a trained model , the goal of pruning is to find the weight , or the set of weights, whose setting to would lead to a minimal increase in the training loss. Under a local quadratic approximation of the loss function, the OBD framework 1990-lecun shows that the “optimal” weight to be removed is the one with the lowest value of the saliency metric and proposed to estimate by a diagonal approximation. The OBS framework 1992-hassibi observed that the remaining (unpruned) weights should also be updated via the optimal perturbation where is the th basis vector. Our algorithm supports these operations in linear-in- time.

Wang et al. 2019-wang raised the valid point that applying the OBS update from above to multiple weights being removed at once may be incorrect, since it ignores possible correlations between those weights. We considered this point in detail, comparing results between OBD pruning 1990-lecun , the OBS update 1992-hassibi , and an augmented version of OBS which disentangles the correlations by solving a corresponding linear system (2020-singh, ). The results, presented in Appendix C.1, suggest that the OBD update is quite beneficial even in its approximate form, and that the effect of correlation is small for unstructured pruning. We use the augmented version of OBS, solving for correlations when feasible.

Experimental Setup.   We prune CNNs (ResNet50 He_2016 and MobileNet-V1 howard2017mobilenets

) on the ImageNet dataset 

russakovsky2015imagenet . The models are standard in the pruning literature gale2019state , and therefore several strong baselines exist. We pick sparsity levels which allow us to compare against previous work. Timing experiments are run on a machine with NVIDIA RTX 2080 Ti GPUs, a 48-core Intel CPU, and 512 GB of RAM. Following (2020-singh, ), we have used batched gradients (of size 16) as single samples inside the Fisher approximation. This does not alter the nature of the results, but reduces variance.

We compare against Global Magnitude Pruning (GMP) 1994-hagiwara , Layer-wise Optimal Brain Surgeon (L-OBS) 2017-dong , Soft Threshold Reparametrization (STR) 2020-kusupati , and WoodFisher 2020-singh . The latter two methods are state-of-the-art for gradual pruning. WoodFisher is a special case of our method, using a block approximation. It is numerically equivalent at the same parameter settings, but has significantly higher computational and storage costs. Relative to it, we therefore focus on executing at higher parameter values. We compare against their public implementation. The Appendix contains full hyper-parameters, additional experiments and ablations with respect to the block size and number of gradients.

Gradual pruning is performed in incremental steps, according to a polynomial schedule (2017-zhu, ), followed by fine-tuning, where the learning rate is gradually decreased. We follow the hyper-parameters of (2020-singh, )

: pruning is performed in steps at 4 epochs apart for the first 24 epochs, followed by fine-tuning. In the following pruning experiments, we will adopt the standard approach of measuring

test accuracy, which is a more practical metric of interest. In all cases, the results correlate perfectly with respect to training accuracy, so we omit this latter metric.

Pruning Results.   Figure 0(a) shows the Top-5 accuracy of one-shot ResNet50 pruned models for L-OBS, GMP, WoodFisher, and M-FAC, where our method is executed with a parameter setting that is infeasible for WoodFisher due to the required computational and storage costs. (We use Top-5 accuracy for this experiment following 2017-dong .) We note the improved accuracy of global methods relative to L-OBS and layer-wise magnitude pruning, and the fact that the Fisher approximation yields consistently better results relative to the global magnitude. This suggests that a better approximation of the Fisher inverse also yields better pruning results, and is in line with the loss approximation study of (2020-singh, ). Generally, we found that estimating using more gradients always improves results, until saturation. However, for some models, smaller block sizes sometimes yield better results for a fixed number of gradients, presumably due to gradient scaling effects.

Figure 0(b) examines the effect of the improvement in one-shot pruning accuracy at each step of pruning, on the final result, when pruning MobileNetV1-STR gradually to sparsity. WoodFisher uses block size , and gradients, while M-FAC uses the same block size but gradients. Executing WoodFisher with the same parameters would be extremely slow. Note the gap in one-shot accuracy following each pruning step: this translates in the final accuracy gap of more than 1% even after the extensive fine-tuning.

(a) One-shot pruning for ResNet50.
(b) Gradual pruning for MobileNetV1-STR.
Figure 1: Left: Accuracy vs. sparsity for one-shot pruning on ResNet50/ImageNet. Right: Accuracy during gradual pruning for the MobileNetV1-STR/ImageNet experiment. The notation M-FAC(, ) means we are basing our approximation on a block size , and gradients.

Table 1 presents gradual pruning results for MobileNetV1-STR/ImageNet at 89% sparsity and ResNet50/ImageNet at 95% sparsity, relative to other state-of-the-art methods. M-FAC outperforms previous methods in the case of MobileNetV1-STR by more than 1.1% Top-1 accuracy. Further, despite using more gradients to estimate the empirical Fisher, the per-step pruning cost of M-FAC is lower than WoodFisher’s as can be seen in Table 2 (for the same parameter settings M-FAC is faster). The ResNet50 results show similar gains, but by smaller margins.

Method MBv1 Top-1 (%) RN50 Top-1 (%)
Dense baseline 72.00 77.01
STR (2020-kusupati, ) 62.10 70.40
Global Magnitude 62.94 71.72
WoodFisher 2020-singh 63.87 72.12
M-FAC 65.01 72.32
Table 1: Gradual pruning results (Top-1 accuracy) for MobileNetV1-STR on ImageNet at 89% sparsity and ResNet50 on ImageNet at 95% sparsity.
Model WF M-FAC
MBv1-STR 10k 400 60m 0.5m
MBv1-STR 10k 4k OOM 8.7m
MBv1-STR all 1k OOM 1.2m
ResNet50 2k 400 35m 2.5m
ResNet50 10k 1k OOM 14.0m
ResNet50 all 512 OOM 4.7m
Table 2: Minutes per pruning step compared with WoodFisher (WF). OOM denotes out of memory on a system with 512GB RAM.

Normalizer-Free Nets.

   We now examine the “compressibility” of the recently-proposed normalizer-free nets, which have shown competitive accuracy based on a similar structure to standard residual networks but without the batch normalization component 

(brock2021high, ). We use the PyTorch re-implementation of (rw2019timm, ). In Figure 1(a), we provide a relative comparison in terms of pruned accuracy between a normalizer-free and a regular version of ResNet50. These two networks have virtually the same number of parameters (approximately 25M); however, we notice that the normalizer-free variant is significantly “easier” to prune, in terms of the relative accuracy drop versus the dense baseline. We conjecture that this is because of the elimination of the BatchNorm layers. Specifically, when performing large one-shot pruning steps, the BatchNorm statistics become invalid following a pruning step, which can lead to a significant loss of accuracy; moreover, their removal may render the Fisher approximation more accurate, as the model loss is more stable in the local neighborhood.

YOLOv3.   Next, we study the effect of one-shot pruning on the YOLOv3-SPP model yolov3

for object detection on the COCO dataset 

lin2014microsoft . We use the state-of-the-art implementation of (ultralytics, ). We one-shot prune this model using global magnitude (the only available baseline) and M-FAC with block size 50K and 1K gradients. (This parameter value would be infeasible for WoodFisher due to storage costs.) The results in terms of the standard mAP@ metric are provided in Figure 1(b), and show that this model is quite stable under pruning, and that M-FAC provides more accurate pruning for this model as well, relative to magnitude pruning.

(a) Comparison between GMP and M-FAC pruning on the standard ResNet50 and a Normalizer-Free variant (NF-ResNet50). Observe the significantly better accuracy given by our method on the normalizer-free variant.
(b) Comparison between GMP and M-FAC in terms of the mAP@0.5 metric for a one-shot pruned complex model YOLOv3-SPP/COCO. Notice the higher accuracy of the M-FAC method.
Figure 2: Left: The effectiveness of M-FAC pruning on Normalizer-Free ResNet50 versus the regular variant. Right: One-shot pruning comparison for the YOLOv3-SPP model on the large-scale object detection COCO dataset.

K-FAC Pruning Comparison.   In Figure 2(a) we compare pruning performance against a pruner which uses K-FAC to estimate second-order statistics (with and without dampening

) on a Fully-Connected network and MNIST dataset. Notice the improved accuracy with M-FAC (and WoodFisher) methods compared to the K-FAC, even in the setting with a small network on a simple task.

Recomputation Effects.   As noted before, our one-shot experiments stretch the OBD/OBS theory, as this approach implicitly assumes that the Hessian stays constant across the direction of pruning, which is unlikely to hold for very large pruning displacements. We can however examine the impact of this effect, by recomputing the Hessian along the direction of pruning. Figure 2(b) shows the effect of recomputation for the ResNet50/ImageNet model, for 5 recomputation steps, uniformly across the pruning direction. Notice the significant increase in accuracy for the resulting sparse models. We note that the results presented in other experiments do not use recomputation, for relative fairness to the global magnitude pruning, for which recomputation would not change the results.

(a) Comparison between K-FAC, WoodFisher and M-FAC in terms of the Top-1 test accuracy for a one-shot pruned Fully-Connected network.
(b) The effect of recomputation on the accuracy of pruned ResNet50/ImageNet model. Observe the significantly better accuracy provided by recomputation, especially at high number of gradients.
Figure 3: Left: One-shot pruning comparison for the Fully-Connected network on the MNIST dataset. Right: The effect of recomputation along the direction of pruning for ResNet50/ImageNet.

4.2 Application 2: Optimization using the Dynamic Algorithm

Matrix-Free Preconditioned SGD.   The dynamic algorithm can provide an efficient implementation for the following variant of SGD. Let be the current iteration index, and be the length of the sliding window during which the estimation of the inverse Hessian is performed. Consider the iteration where is the learning rate. Let be the gradients obtained at steps , respectively. Then, we define the preconditioning matrix as that is, an empirical Fisher matrix generated with respect to a sliding window over the last gradients. At each step, our dynamic algorithm first removes and adds to the inverse estimate, and then computes .

We emphasize the fact that this optimizer does not employ momentum, and therefore only relies on the direct approximation of the curvature.

Setup and Objectives.   We now experiment with the above optimizer, implemented using the dynamic algorithm, without momentum. We compare against standard optimizers such as SGD with momentum and Adam (kingma2014adam, ), but also against approximate second-order methods such as K-FAC martens2018kroneckerfactored , GGT (agarwal2019efficient, ), and AdaHessian (yao2020adahessian, )

. For the former two, we use the implementations contributed to TensorFlow 

(abadi2016tensorflow, ) by the authors, while for the latter we use the authors’ PyTorch implementation. For fairness, we always follow the settings recommended by the authors (even though this makes overheads harder to compare); for the best accuracy on the models we considered, we grid-searched learning rates and weight-decay for each method in turn.

We examine the loss and test accuracy for each method, on the standard ResNet20 and ResNet32 models (He_2016, )

on the CIFAR-10 dataset 

(cifar100, ) (see also Figures 3(a) and 3(b)). Since implementations have slightly different framework requirements, our main performance metric is overhead over SGD, measured in each algorithm’s environment, on an NVIDIA RTX 3090 GPU for PyTorch and Titan RTX for TensorFlow. For all second-order methods, we compute the preconditioner at each step—reducing the frequency reduces overheads proportionally for each method, but introduces an additional hyper-parameter. We provide code, as well as additional information, in the Appendix.

(a) Training loss on ResNet20/CIFAR-10.
(b) Top-1 test accuracy on ResNet20/CIFAR-10.
Figure 4: Comparison of different optimizers on the ResNet20/CIFAR-10 example.

Our first results are presented in Table 3 (left), where we examine the best Top-1 test accuracy obtained by each method, as well as its overhead relative to SGD. For K-FAC, the best accuracy is obtained with batch size 1000, while all other methods operate on batch size 128. Similarly, GGT’s best results are achieved with window size , while for M-FAC larger window sizes improve results; we use . Due to these discrepancies, we measure overheads with “best accuracy” settings and also in the same setup as M-FAC; the latter is given in parentheses. We emphasize the fact that, when available, we used the hyper-parameter values recommended by the authors for the given model and dataset; otherwise, for fairness, we perform exactly the same grid search process for the tuning of each method.

In terms of accuracy, M-FAC scores highest in the ResNet20 experiment, and comes third, to tuned SGD and AdaHessian (performing almost the same), on ResNet32. The overhead of M-FAC is around , which is much lower than AdaHessian, on par with K-FAC, but higher than GGT. However, these overheads are comparable only when GGT uses 5x less gradients (counter-intuitively, larger window sizes sometimes make results worse for GGT). On the same settings, M-FAC provides up to 5x lower overhead. We examined this performance gap in detail by breaking down performance per operation, and determined that the difference is due to: 1) the better performance of our dynamic algorithm relative to the SVD computation used by GGT; 2) the difference in complexity due to the term in GGT, versus our term. Similarly, K-FAC’s overhead for the same batch size as M-FAC would be substantially higher.

Table 3 (right) provides a more realistic scenario, where we run SGD, Adam, and M-FAC without tuning or weight-decay on Wide Residual Networks (WRN) (zagoruyko2016wide, ) for CIFAR-100, as well as on MobileNetV1/ImageNet, and examine test accuracy. M-FAC achieves the highest accuracy on most runs, even on ImageNet. However, we emphasize that these results can probably be improved significantly by parameter tuning. Another interesting application of M-FAC is fine-tuning sparse models. We fine-tune 90% sparse MobileNetV1/ResNet50 models for 30 epochs starting from the last gradual pruning step and find that M-FAC reaches higher final accuracy than SGD in both cases. At the same time, it is only negligibly slower ( overhead) since our algorithm’s complexity is linear in the model density.

ResNet20 ResNet32 Method Acc. Overhead Acc. Overhead SGD 91.78 92.80 Adam 89.67 90.56 KFAC 91.65 () 90.09 () GGT 88.38 () 89.14 () AdaHessian 92.17 92.81 M-FAC 92.34 92.65 Model SGD Adam M-FAC (Overh.) WRN 22-2 69.93 66.90 69.76 () WRN 40-2 71.75 70.14 72.42 () WRN 22-4 73.13 72.52 74.06 () MBv1 Dense 68.06 67.92 68.96 () MBv1 Sparse 64.11 64.78 () RN50 Sparse 74.78 75.10 ()
Table 3: Left: Comparison of second-order optimizers with individual tuning and weight decay. Right: Additional experiments with no tuning and weight decay. Sparse results are from fine-tuning after the last step of gradual pruning to sparsity. For the dense models M-FAC uses and for the sparse ones . For KFAC/GGT we indicate in parentheses the overhead in a comparable setting to M-FAC (i.e. same batch size/number of gradients).
Wall-clock time comparison.

In Figures 5 and 6 we show the test accuracies during the training of Wide ResNet (WRN) on the CIFAR-100 dataset and MobileNetV1 on the ImageNet dataset with SGD, Adam and M-FAC with respect to the wall clock time, when executed on a single NVIDIA RTX 3090 GPU. (In the case of the largest network, WRN 22-4, we use a second GPU’s memory to store gradients, but still use a single GPU for computation.) One can see that, in most plots, M-FAC reaches better accuracies than SGD already after only slightly higher training time. Further, we can see that, for the WRN model, the early training accuracy increases fastest for M-FAC even with respect to the wall-clock time, a phenomenon often observed with methods which try to leverage approximate second-order information.

(a) Accuracy versus time for WRN 40-2/CIFAR-100.
(b) Accuracy versus time for MobileNetV1/ImageNet.
Figure 5: Optimization results in accuracy-versus-time format for WRN40-2 and MobileNetV1.
(a) Accuracy versus time for WRN 22-2/CIFAR-100.
(b) Accuracy versus time for WRN 22-4/CIFAR-100.
Figure 6: Optimization results in accuracy-versus-time format for Wide Residual Networks (WRN).

5 Discussion

We presented static and dynamic linear-in- algorithms for computing IHVPs when the Hessian matrix can be approximated by a sum of rank-1 matrices. We used the classic empirical Fisher approximation, but our results can apply more broadly. The main practical limitation of our approach is the cost of storing the additional gradients. For the static algorithm, we can leverage system memory to circumvent this limitation, via an efficient paging mechanism, which we describe in the Appendix. The dynamic algorithm could parallelize gradient storage, or perform gradient compression 2017-sun . We plan to investigate this in future work, as well as a larger-scale study of optimization performance.

Acknowledgments

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 805223 ScaleML).

References

  • [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pages 265–283, 2016.
  • [2] Naman Agarwal, Brian Bullins, Xinyi Chen, Elad Hazan, Karan Singh, Cyril Zhang, and Yi Zhang. Efficient full-matrix adaptive regularization. In International Conference on Machine Learning, pages 102–110. PMLR, 2019.
  • [3] Shun-ichi Amari. Natural gradient works efficiently in learning. Neural Computation, 10(2):251–276, 1998.
  • [4] Shun-ichi Amari. Information geometry and its applications, volume 194. Springer, 2016.
  • [5] Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kronecker-factored approximations. 2016.
  • [6] Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. arXiv preprint arXiv:2102.06171, 2021.
  • [7] Xin Dong, Shangyu Chen, and Sinno Jialin Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon, 2017.
  • [8] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, 2010.
  • [9] Glenn Jocher et al. YOLOv5 model implementations, AWS, Supervise.ly and YouTube integrations, April 2021.
  • [10] Elias Frantar and Eldar Kurtic. M-FAC implementation: {https://github.com/IST-DASLab/M-FAC}, 2021.
  • [11] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks, 2019.
  • [12] Roger Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers, 2016.
  • [13] Masafumi Hagiwara. A simple and effective method for removal of hidden units and weights. Neurocomputing, 6(2):207 – 218, 1994. Backpropagation, Part IV.
  • [14] Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems 5, [NIPS Conference], page 164–171, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 770–778, 2016.
  • [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016.
  • [17] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam.

    Mobilenets: Efficient convolutional neural networks for mobile vision applications, 2017.

  • [18] Neural Magic Inc. Sparseml framework: https://github.com/neuralmagic/sparseml, 2021.
  • [19] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2014.
  • [20] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 2575–2583. Curran Associates, Inc., 2015.
  • [21] Shankar Krishnan, Ying Xiao, and Rif A Saurous. Neumann optimizer: A practical optimization algorithm for deep neural networks. arXiv preprint arXiv:1712.03298, 2017.
  • [22] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [23] Frederik Kunstner, Philipp Hennig, and Lukas Balles. Limitations of the empirical fisher approximation for natural gradient descent. In Advances in Neural Information Processing Systems, pages 4156–4167, 2019.
  • [24] Aditya Kusupati, Vivek Ramanujan, Raghav Somani, Mitchell Wortsman, Prateek Jain, Sham Kakade, and Ali Farhadi. Soft threshold weight reparameterization for learnable sparsity, 2020.
  • [25] César Laurent, Thomas George, Xavier Bouthillier, Nicolas Ballas, and Pascal Vincent. An evaluation of fisher approximations beyond kronecker factorization, 2018.
  • [26] Yann Le Cun, John S. Denker, and Sara A. Solla. Optimal Brain Damage, page 598–605. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990.
  • [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [28] Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1):503–528, 1989.
  • [29] Alexander Ly, Maarten Marsman, Josine Verhagen, Raoul Grasman, and Eric-Jan Wagenmakers. A tutorial on fisher information, 2017.
  • [30] James Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, page 735–742, Madison, WI, USA, 2010. Omnipress.
  • [31] James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.
  • [32] James Martens, Jimmy Ba, and Matt Johnson.

    Kronecker-factored curvature approximations for recurrent neural networks.

    In International Conference on Learning Representations, 2018.
  • [33] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature, 2015.
  • [34] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • [35] Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Large-scale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
  • [36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  • [37] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv, 2018.
  • [38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [39] Sidak Pal Singh and Dan Alistarh. Woodfisher: Efficient second-order approximation for neural network compression, 2020.
  • [40] Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. meProp: Sparsified back propagation for accelerated deep learning with reduced overfitting. In Proceedings of the Thirty-Fourth International Conference on Machine Learning, 2017.
  • [41] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. Faster gaze prediction with dense networks and fisher pruning, 2018.
  • [42] Valentin Thomas, Fabian Pedregosa, Bart Merriënboer, Pierre-Antoine Manzagol, Yoshua Bengio, and Nicolas Le Roux. On the interplay between noise and curvature and its effect on optimization and generalization. In

    International Conference on Artificial Intelligence and Statistics

    , pages 3503–3513. PMLR, 2020.
  • [43] Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. Eigendamage: Structured pruning in the kronecker-factored eigenbasis, 2019.
  • [44] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  • [45] Max A Woodbury. Inverting modified matrices. Statistical Research Group, 1950.
  • [46] Zhewei Yao, Amir Gholami, Sheng Shen, Kurt Keutzer, and Michael W Mahoney. Adahessian: An adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719, 2020.
  • [47] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
  • [48] Wenyuan Zeng and Raquel Urtasun. MLPrune: Multi-layer pruning for automated neural network compression, 2019.
  • [49] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017.

Appendix A Efficiently Implementing the Dynamic Algorithm

As mentioned in Section 3.2, directly implementing the various recursive formulas of the dynamic algorithm will most likely be quite slow in practice. Thus, we now discuss how to develop an efficient practical implementation. The reader can also find the full code for this implementation of the dynamic algorithm (and the corresponding optimizer) in PyTorch, available at [10].

Setup.

We begin by vectorizing the calculations of and . Algorithm 1 describes the full setup procedure in PyTorch-like pseudocode.

  
  
  for  do
     
  end for
  
  for  do
     
  end for
Algorithm 1 Calculating and in time and memory, assuming that is already precomputed.

Initially, is the stored scalar product matrix scaled by . Row 1 already has the correct final values. The algorithm then goes through iterations, each completing one row but also updating all the rows of higher index. More concretely, in iteration , we subtract the outer product of the previous row starting at the element one after the diagonal with itself, scaled by the inverse of plus the previous diagonal element , from the square sub-matrix with as the upper left corner, i.e. . This process is also visualized in Figure 7, where the already calculated elements are shaded in light blue, the ones that are currently being updated in darker blue, and the irrelevant ones in grey. Since the calculation of , as given here, will also produce some unnecessary (and incorrect) values below the diagonal, we delete those after the loop by extracting just the upper triangular part of .

starts off as the identity matrix times

(and row 1 is again already done). Next, is calculated row by row, where row is a negative linear combination of the previous rows (up to index as is lower triangular) with the coefficients given by the first elements of column of , i.e. , each divided by plus the corresponding diagonal element of , i.e. . This process is also shown in Figure 7.

Figure 7: Visualized calculation of and .

In theory, the algorithm presented so far should be well suited for utilizing the massive parallel computation capabilities of a GPU. However, a straightforward implementation in e.g. PyTorch, will not be very fast for larger values of as it results in thousands of small kernel launches with overhead that adds up to a very significant overall slow-down. (Concretely, we measured overhead for ). This means that, to achieve good performance, the setup for the coefficient computation of the dynamic algorithm needs to be implemented with (custom) CUDA kernels that merge computations to get rid of the excessive kernel launch overhead.

Carefully studying the calculation process of reveals that it is actually very similar to a Cholesky decomposition (concretely, the only difference is that it does not use the square root of the diagonal element but the element itself plus ). In fact, it is actually exactly equivalent to taking the upper triangular matrix in the LU-decomposition (with no pivoting) of and then subtracting again. This means that it is possible to reuse the highly optimized LU-decomposition kernel provided by PyTorch to calculate quite efficiently (a custom kernel implementing the modified Cholesky decomposition would certainly be faster, but we already observed pretty low overheads with the LU version). For the efficient computation of , a custom kernel is needed, which we provide at [10], and describe next.

The main idea is that we can dramatically reduce the number of separate kernel launches by computing multiple rows of in a single kernel. In general, we split into blocks of (this matches the fact that modern NVIDIA GPUs have 1024 threads per block). Then, we begin by fully computing all diagonal blocks in parallel. Notice that, due to the lower triangular structure of , those are fully independent. Similarly, each block depends exclusively on the blocks above it (up to the diagonal). Next, we perform an iterative process which will update all remaining blocks with respect to the new values of the most recently computed diagonal before completing the calculation of the blocks adjacent to it. This new diagonal is then the reference for the next iteration. Figure 8 visualizes this process. The calculation within the individual blocks is easy to parallelize. For more details, please refer to our code for this CUDA kernel. Overall, this way of calculating is quite fast; e.g. for , it takes milliseconds to execute on an NVIDIA RTX 3090 GPU.

Figure 8: Visualization of the CUDA kernel for calculating . In each iteration, all blocks below the previous diagonal (here denoted as 1) are updated in parallel and those adjacent to the diagonal (here denoted as 2) are completed to form the reference diagonal for the next iteration.
IHVPs

After discussing an efficient implementation of the precomputation steps, we now focus on actually computing IHVPs. Algorithm 2 presents a vectorized implementation.

  
  
  for  do
     
     
  end for
  return
Algorithm 2 Computing in time, assuming that and have already been precomputed.

We note that if , then , i.e. the last column of divided element-wise by the diagonal values. This can be utilized for a more efficient combined update-multiply operation, which is essential for the M-FAC optimizer, where we first update our Hessian estimation with a gradient and then return the IVHP with this same . Further, reusing the same for both updating as well as calculating , saves one redundant matrix-vector-product.

Appendix B Efficiently Implementing the Static Algorithm

Although the formulas discussed in the main body of the paper have low theoretical time complexity, a direct implementation does not make the best use of modern machine learning frameworks and GPUs. Therefore, we now present more efficient matrix-based versions. We note that a block-wise version would simply apply the techniques discussed here independently to each block.

Let be the row-wise matrix of the and the -vector of the . Then, can be written as follows, where denotes the element-wise division:

(23)

It is also possible to extract one element of each matrix row simultaneously (in terms of vectorized operations) with cost, for example the entire diagonal. Let be a mapping such that where is the column of the element to select in row , e.g. to get the diagonal, and a vector such that if and otherwise. Using these definitions, the calculation of the desired result is described below, where denotes the element-wise product.

(24)

Our implementation contains several additional optimizations. For instance, we can efficiently precompute and by repeatedly applying (23) to produce the next . In the next section, we additionally discuss several memory-saving optimizations, and in particular explicit page swapping between CPU and GPU memory for situations where does not fully fit in the GPU memory.

Finally, Algorithm 3 demonstrates, using a Python-like matrix indexing syntax, how to efficiently precompute and by repeatedly applying (23) to produce the next . It should be noted that the row-wise matrix of gradients is re-purposed as , a simple way to halve the peak memory consumption in practice.

  
  
  
  
  for  do
     
     
     
  end for
Algorithm 3 Precomputation of and given .

b.1 Additional Optimizations

As we have seen, obtaining a good Fisher approximation of the Hessian requires a sizable number of gradients, which means that, for bigger networks, it can easily happen that the collection of all gradients used for Fisher estimation does not fully fit into GPU memory. We now discuss how to efficiently handle such situations, with an implementation that performs explicit swapping of gradient blocks between GPU and CPU memory (RAM). The most important steps of the method to be discussed are also visualized in Figure 9.

In general, a simple trick to halve the peak memory consumption of the static algorithm is repurposing the gradient matrix as , which is possible as is not needed anymore after and have been calculated. Now, for the explicit swapping implementation we first split the collection of gradients into blocks / pages containing at most gradients each. Those are then turned into the corresponding blocks in increasing order of . Additionally, we maintain a single buffer block of the same size for accumulating intermediate results. All blocks reside in CPU memory and are only loaded to the GPU as needed, meanwhile is small enough to be kept in GPU memory at all times. To compute we first load block fully into GPU memory. Then we load the first gradient of denoted by and compute the corresponding with respect to the loaded , which is afterwards saved in the first index of the buffer . After repeating this for all , block is swapped with block and the whole process starts again, accumulating the resulting into the buffer. Eventually, after handling , we can load into GPU memory, finish the calculation of the and store them in (reusing the memory of ). It should be noted that the loading of can be parallelized with the calculation of and thus costs almost no extra time. Finally, one wants to choose the number of blocks to be as small as possible, to minimize the overhead caused by the page swaps.

Figure 9: Visualization of the swapping implementation of the static algorithm, grey blocks have already been fully calculated. Left: is loaded from block , which allows calculating using the GPU block and then accumulating it into the buffer . Right: Calculation of is finished using the completed buffer.

Overall, the implementation described above is effective in practice and allows scaling up our static algorithm (with only modest overhead over a pure GPU implementation), in its full non-blockwise version, to a large number of gradients, even for relatively large networks such as ResNet50.

Lastly, we note that, for a block-wise implementation, where a single block and all its corresponding gradient parts, easily fit into GPU memory (i.e. the opposite of the situation described thus far in this section), it can be beneficial to handle multiple blocks simultaneously via batch-matrix-multiplies. We provide an implementation that can do this, leading to very significant speed-ups on lower block-sizes where otherwise most of the time is spent constantly loading in new memory from the CPU.

Appendix C Additional Experimental Results

c.1 Pruning Update Approximation

As discussed in the main paper, we use the OBS pruning framework but prune multiple weights at the same time, which as pointed out by [43], in extreme cases, is not guaranteed to be better than OBD due to not taking into account potential correlations. Thus, we now explore how much of a problem this is in practice. To do that, we compare three different M-FAC variations: not updating the non-pruned weights at all (i.e. OBD, but with a better Hessian approximation), jointly applying the OBS update for all pruned weights (i.e. the OBS approximation) and applying the correct simultaneous update by solving a large linear system (see [39] for details). We prune a ResNet20/CIFAR-10 model in steps of 10% down to 90% sparsity, using three variations of the full block size M-FAC with while recomputing the inverse Hessian approximation after every pruning step. The recomputations ensure that we have a reasonably accurate loss approximation at every pruning step and that the number of weights pruned at each step is small enough to solve the corresponding linear system. Table 4 provides the results.

Method 10% 20% 30% 40% 50% 60% 70% 80% 90%
No update (OBD) 91.4 91.3 91.0 90.2 88.4 84.8 77.2 50.4 11.5
Simultaneous OBS 91.4 91.4 91.5 91.2 90.8 89.7 86.8 75.6 28.2
Linear solving 91.5 91.4 91.4 91.2 90.7 89.7 87.5 75.7 28.1
Table 4: Comparing ResNet20/CIFAR-10 accuracies of different M-FAC variations at varying levels of sparsity.

The results show a very clear gap in accuracy between the no-update and the approximate update version. At the same time, there appears to be almost no difference between the simultaneous OBS and the true OBS update, which has to solve a (potentially very large) linear system. This suggests that the simultaneous update approximation done for computational tractability is also a reasonable choice in terms of achieved accuracy.

c.2 Ablation Studies for One-Shot Pruning

To further examine the properties of the local loss approximation, we now present ablation studies on pretrained ResNet50/ImageNet and MobileNetV1-STR/ImageNet models. Experiments perform one-shot pruning according to the OBS metric estimated using M-FAC, with varying block size and number of gradients. The dampening factor in these experiments is set to . Following [39], we used batched gradients (of size 16) as single samples inside the Fisher approximation. (This does not alter results, but reduces variance.)

The goal of these experiments is to examine two questions: 1) does larger block size always imply a better approximation and 2) does higher number of gradients always imply a better approximation? We will see that neither of these questions have obvious answers, and will discuss a meta-algorithm for choosing the block size and number of gradients.

The numbers presented for our one-shot experiments are the averages over two runs. As the variance is very small, we omit error bars. We sometimes break the convention and “zoom into” the y axis for visibility.

c.2.1 ResNet50/ImageNet

The first set of experiments examines the dependency on the number of gradients and block size for the ResNet50/ImageNet model. The left subfigure in Figure 10 shows results for block sizes between and all weights, i.e. , for a fixed number of gradients, while the right subfigure shows the same, but for gradients. The first graph presents a fairly unintuitive finding, i.e. that lower block sizes appear to provide better pruning accuracy. We analyze this finding in detail and explain it in Section C.3: roughly, this is due to the fact that gradient entries are scaled by the gradient norm over the block. As predicted by our analysis, this effect is mitigated by increasing the number of gradients used for estimation, in which case block size yields the best results, and the performance of full block size also improves. Please see Section C.3 for the full analysis.

Figure 11 examines the complementary effect, that of number of gradients for a fixed block size (10K and 100K, respectively). The results suggest that more gradients help improve the estimation of the Fisher matrix, although we observe a saturation effect. We also note that, at higher one-shot sparsities (e.g., 60% in one-shot, not shown for visibility), this effect does not always hold. However, for such large pruning “perturbations” the pruning assumption that the Hessian is constant along the direction of pruning is unlikely to hold, which affects the stability of the results.

Figure 10: An ablation study on the ResNet50/ImageNet model showing the effect of the block size for a fixed number of gradients.

c.2.2 MobileNetV1-STR/ImageNet

The second set of experiments shows the dependency on the number of gradients and block size for the MobileNetV1-STR/ImageNet model. We use the implementation and pre-trained weights from [24]. Figure 12 shows results for block sizes between and all weights, i.e. , for a fixed number of gradients ( in the left subfigure and in the right subfigure). Again, in both cases, for this model lower block sizes appear to provide an improved approximation. This is due to the gradient scaling effects which we discuss in Section C.3. However, the results show that for this compact model these effects are more prevalent than for the ResNet50 model.

Figure 11: An ablation study on the ResNet50/ImageNet model showing the effect of the number of gradients for a fixed block size.
Figure 12: An ablation study on the MobileNetV1-STR/ImageNet model showing the effect of the block size for a fixed number of gradients.

Figure 13 examines the opposite effect, that of the number of gradients for a fixed block size ( in the left subfigure and in the right subfigure). We show results for the number of gradients varying between and . The results clearly suggest that more gradients help improve the accuracy, although the improvement appears to saturate, e.g. between and gradients.

Figure 13: An ablation study on the MobileNetV1-STR/ImageNet model showing the effect of the number of gradients for a fixed block size.

c.3 Discussion

Normalization Effects.

We first discuss our finding that, in the case of some models, e.g. MobileNetV1, smaller blocks appear to yield better accuracy. This can be explained by examining the recursive form of the elements of the diagonal inverse. Specifically, without blocking, we get that the th diagonal element has the form

where and represent the th element of a gradient. Specifically, notice that, in the case of “full” block size (“all” or ), the th gradient entry is divided by the full gradient norm, which may cause it to become negligible if the norm is large. In turn, this leads to essentially the magnitude pruning ranking and update. By contrast, in the case of smaller blocks, the entry is only divided by the norm of the gradient over the block, which mitigates this normalization issue. Similarly, using more gradients also helps, since there are more summands in the above expression, which allows us to deviate from the magnitude pruning baseline.

Choosing the block size value.

Given this, our procedure for choosing the best block size hyper-parameter value is to simply perform a one-shot pruning ablation study for this parameter, usually at the maximum gradient count at which one can reasonably execute, at a moderate sparsity (e.g., 40% from a pretrained model). We then choose the best performing block size, in terms of the model’s evaluation metric. We note that these single pruning steps are quite fast when using our implementation.

c.4 Gradual Pruning Experiment Hyperparameters

We begin by stating hyperparameters for our gradual pruning runs. For both MobileNetV1-STR and ResNet50-STR, we employ

identical hyperparameter values to WoodFisher [39], so that our results are fully comparable. The only parameters we change are the ones which pertain to our algorithm, in particular block size and number of gradients .

For the MobileNetV1-STR gradual experiments, pruning starts from the fully-trained model used by STR/WoodFisher, whose Top-1 validation accuracy is 72%. Gradual pruning is then performed over a total of 24 epochs, pruning additional weights every epochs, followed by fine-tuning until the next pruning step. Pruning targets are set following the standard polynomial schedule of [49, 11]. Unless otherwise stated, SGD is used for fine-tuning. The learning rate during pruning is , momentum is set to , and weight decay is . Following the last pruning step, fine-tuning is applied until epoch . The initial learning rate value of is decayed multiplicatively by a factor of , every epoch, starting from the epoch .

For ResNet50-STR gradual experiments, pruning starts from the fully-trained model used by STR/WoodFisher, whose Top-1 validation accuracy is 77.01%. Gradual pruning is then performed over a total of epochs, pruning additional weights every epochs, followed by fine-tuning until the next pruning step. Pruning targets are set following the standard polynomial schedule of [49, 11]. Unless otherwise stated, SGD is used for fine-tuning. The learning rate during pruning is , momentum is set to , and weight decay is . Following the last pruning step, fine-tuning is applied until epoch . The initial learning rate value of is decayed multiplicatively by a factor of , every epochs, starting from epoch until epoch .

c.5 Optimization Experiment Hyperparameters

ResNet20/32

We now discuss the hyperparameters used for the ResNet20/32 comparison between M-FAC and various first and second order optimizers. Both models are trained with batch-size 128 for 164 epochs ( steps) while dropping the learning rate by a factor of after and of training. This is exactly the training setup used by [16]. The best initial learning rate and weight decay values were determined by starting from the values suggested by the authors of the respective methods; in addition, we have performed a grid search around these values to examine whether better results can be obtained. In particular, Table 5 summarizes the final hyper-parameters settings for SGD, Adam, AdaHessian and M-FAC.

Method learning rate momentum weight decay
SGD 0.1 0.9 0.0001
Adam 0.001 (0.9, 0.999)
AdaHessian 0.015 (0.9, 0.999) 0.003 (RN20), 0.0005 (RN32)
M-FAC () 0.001 0.003
Table 5: Hyperparameter settings used for the ResNet20 (RN20) and ResNet32 (RN32) experiments.

All experiments were repeated 5 times with different random seeds and we report the median of the best test accuracy; the standard deviations were generally quite low at around

percent accuracy.

Hyperparameters for GGT and K-FAC.

Unfortunately, GG