DeepAI
Log In Sign Up

Generalizing and Improving Jacobian and Hessian Regularization

12/01/2022
by   Chenwei Cui, et al.
0

Jacobian and Hessian regularization aim to reduce the magnitude of the first and second-order partial derivatives with respect to neural network inputs, and they are predominantly used to ensure the adversarial robustness of image classifiers. In this work, we generalize previous efforts by extending the target matrix from zero to any matrix that admits efficient matrix-vector products. The proposed paradigm allows us to construct novel regularization terms that enforce symmetry or diagonality on square Jacobian and Hessian matrices. On the other hand, the major challenge for Jacobian and Hessian regularization has been high computational complexity. We introduce Lanczos-based spectral norm minimization to tackle this difficulty. This technique uses a parallelized implementation of the Lanczos algorithm and is capable of effective and stable regularization of large Jacobian and Hessian matrices. Theoretical justifications and empirical evidence are provided for the proposed paradigm and technique. We carry out exploratory experiments to validate the effectiveness of our novel regularization terms. We also conduct comparative experiments to evaluate Lanczos-based spectral norm minimization against prior methods. Results show that the proposed methodologies are advantageous for a wide range of tasks.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

09/14/2020

Input Hessian Regularization of Neural Networks

Regularizing the input gradient has shown to be effective in promoting t...
09/14/2012

Hessian Schatten-Norm Regularization for Linear Inverse Problems

We introduce a novel family of invariant, convex, and non-quadratic func...
05/31/2017

Spectral Norm Regularization for Improving the Generalizability of Deep Learning

We investigate the generalizability of deep learning based on the sensit...
08/23/2017

Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information

We consider variants of trust-region and cubic regularization methods fo...
10/15/2019

Adjoint-based exact Hessian-vector multiplication using symplectic Runge–Kutta methods

We consider a function of the numerical solution of an initial value pro...
02/24/2021

Learning-Augmented Sketches for Hessians

Sketching is a dimensionality reduction technique where one compresses a...
08/24/2020

The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement

Existing disentanglement methods for deep generative models rely on hand...

Code Repositories

1 Introduction

Regularizing the Jacobian and Hessian matrices of neural networks with respect to inputs have long been of interest due to their connection with the generalizability and adversarial robustness of neural networks (Drucker and Le Cun, 1992; Varga et al., 2018; Mustafa et al., 2020). However, exact construction of the Jacobian and Hessian matrices is expensive. The computational cost scales linearly with the dimensionality of network inputs and outputs (Chen and Duvenaud, 2019; Mustafa et al., 2020)

. Early attempts to alleviate this difficulty either are not able to scale up to neural networks with high dimensional inputs and outputs, or rely on cumbersome designs while still having large estimation variances

(Drucker and Le Cun, 1992; Gu and Rigazio, 2014; Martens et al., 2012).

With the emergence of vector-Jacobian product (VJP) (Paszke et al., 2017), Jacobian-vector product (JVP) (Hirsch, 1974), and Hessian-vector product (HVP) (Pearlmutter, 1994), recent efforts have turned to well-established matrix-free methods. Such methodologies are more principled and elegant in that they reuse existing theories and are backed with mature implementations.

One class of these works uses Hutchinson’s trace estimator (Hutchinson, 1990)

to construct unbiased estimates for quantities such as the Frobenius norm of Jacobians or Hessians

(Varga et al., 2018; Hoffman et al., 2019; Song et al., 2020). However, such methods endure large variances that hinder training and generalization due to the stochastic nature of Hutchinson’s estimator.

A recently emerging line of research instead focuses on minimizing the spectral norms of Jacobians and Hessians (Johansson et al., 2022; Mustafa et al., 2020). The rationale is two-fold. For one thing, due to the equivalence of norms, reducing spectral norms minimizes Frobenius norms. For another, spectral norms can be accurately obtained for a constant computational cost (Golub and van der Vorst, 2000)

. However, current efforts still rely on rudimentary algorithms such as Power Method or gradient ascent to calculate the spectral norms, overlooking the mature line of research for eigenvalue problems. In fact, existing research strongly indicates that Lanczos algorithm is ideal for this task

(Paige, 1972; Golub and van der Vorst, 2000).

Another unsatisfactory phenomenon we observe is the lack of flexibility: most existing works focus on training Jacobians and Hessians into zero (Drucker and Le Cun, 1992; Varga et al., 2018; Mustafa et al., 2020)

, and few have explored the possibility of training them into arbitrary matrices, much less matrices with certain properties, such as symmetry and diagonality. As we later discuss in Sec. 4.1, enforcing symmetry or diagonality upon square Jacobian and Hessian matrices has important implications for Energy-based Models (EBMs)

(Salimans and Ho, 2021) and generative models (Peebles et al., 2020).

In this work, we first generalize the regularization of Jacobians and Hessians, allowing for the conformation to arbitrary target matrices or matrices with certain properties. Next, we propose Lanczos-based spectral norm minimization, an improved methodology to optimize the regularization terms.

We start by deriving conditions under which a target matrix can be conformed to. Following the conditions, we propose novel regularization terms that match a Jacobian or Hessian matrix with a function of itself. We show that the proposed regularizer can enforce symmetry and diagonality upon square Jacobian and Hessian matrices of neural networks.

To reliably optimize the proposed regularization terms, we implement a parallelized version of the Lanczos algorithm (Paige, 1972). We provide the detail of the algorithm and explain how to perform the subsequent spectral norm minimization.

To validate the effectiveness of our proposed regularizers, we construct exploratory high-dimensional tasks that are detailed in Sec. 4.1. We observe strong results that adhere to our theoretical analyses.

To rigorously compare our Lanczos-based spectral norm minimization with previous methodologies, we present extensive controlled experiments in the context of adversarial robustness. We implement all methodologies ourselves to ensure a rigorous and fair comparison. We use ResNet-18 (He et al., 2016)

and CIFAR-10 and CIFAR-100 datasets

(Krizhevsky and Hinton, 2009). Strong and standard adversary, namely PGD(20), is used to evaluate the performance. A running time analysis is also conducted for our technique. The experiments show that our Lanczos-based spectral norm minimization not only is efficient to compute but also surpasses prior methods by a large margin, in terms of performance.

To summarize our main contributions:

  • We generalize the task of regularizing Jacobians and Hessians of neural networks with respect to inputs, permitting arbitrary target matrices.

  • We explore novel training objectives that enforce symmetry or diagonality for square matrices, which are validated theoretically and empirically. It opens up new possibilities for applications.

  • We propose Lanczos-based spectral norm minimization, an effective technique for Jacobian and Hessian training. Not only being theoretically sound, experiments also show evident improvements over prior methods.

Notation. We summarize the notations used throughout this paper. By convention, we use regular letters for scalars and bold letters for both vectors and matrices. Neural networks are denoted by function , which can have single or multiple outputs with respect to the specific situation. The Jacobian matrix of at point is denoted by , and the Hessian matrix of at point is denoted by . When we talk about Jacobian or Hessian matrices, we use to denote or . In some circumstances, for simplicity, we omit the inputs. Instead, we denote them as . For any vector , we denote as its L2 norm. For any matrix , means its transpose, its Frobenius norm is denoted by , and its spectral norm is denoted by . denotes the trace of . denotes the unit matrix. means the entries of matrix .

denotes its largest singular value.

denotes its largest eigenvalue, and its corresponding unit eigenvector is denoted by

.

2 Related Work

Early efforts focused on training the Jacobians of neural networks with respect to inputs trace back to Drucker and Le Cun (1992)

. The authors propose double propagation to regularize the Jacobians of loss functions. However, this algorithm only applies to computational graphs with single outputs. For a more general case,

Gu and Rigazio (2014) utilize layer wise approximations to regularize the Jacobians of neural networks with multiple inputs and outputs. For Hessians, Kingma and Cun (2010)

reduce the cost of backpropagation by limiting it to differentiate the diagonal of a Hessian matrix.

Martens et al. (2012) later introduce curvature propagation, an algorithm to produce stochastic estimates for Hessian matrices.

Those earlier attempts either are not able to scale up to neural networks with high dimensional inputs and outputs or rely on cumbersome design while still having large estimation variances.

Recent attempts have turned to well-established matrix-free techniques and are either based on Hutchinson’s estimators or spectral norm minimization.

Hutchinson’s estimator (Hutchinson, 1990) takes the form , where is an arbitrary square matrix and is a random vector such that . It follows when ,

(1)

Instead, when is drawn from a multivariate Rademacher distribution,

(2)

Varga et al. (2018) first propose to use random projections to regularize the Jacobian of a neural network. The same technique is later revisited by Hoffman et al. (2019). Specifically, given , is minimized. This is an instance of Hutchinson’s estimators in that

Consequently, the variance is a significant .

For Hessians, Song et al. (2020) propose sliced score matching, in which Hutchinson’s estimators are used to maximize the trace of Hessians. However, sliced score matching is often observed to be too stochastic and less performant, compared with its Hessian-free counterparts (Vincent, 2011).

Another significant adoption of Hutchinson’s estimator is the Hessian penalty Peebles et al. (2020). The authors propose unbiased estimators to regularize off-diagonal elements of Hessians. Essentially, this technique is built upon Eq. (2). Nonetheless, this estimator endures high variance since, in practice, the authors use empirical variance, calculated from only two samples.

Being unbiased estimators, Hutchinson-based methods are theoretically sound. However, in practice, the variance of these estimators is significant and reduces performance. (we validate in experiments)

Spectral norm minimization is a recently emerging line of research that instead focuses on minimizing the spectral norms of Jacobians and Hessians. Spectral norms can be accurately obtained at a constant cost (Golub and van der Vorst, 2000), and has the norm equivalence

for any matrix of rank .

Input Hessian regularization (Mustafa et al., 2020) considers the term and uses gradient ascent to solve for , in an attempt to find the spectral norm of . We however show in Appendix A that this method is closely related to power iteration. In certain cases, they are outright equivalent. However, power iteration generally converges much slower than Lanczos algorithm. Depending on the matrix, it may even cease to converge (Golub and van der Vorst, 2000). Since both methods have computational costs dominated by matrix-vector products and therefore take up similar running times, it is hard to justify using power iteration instead of Lanczos algorithm.

For Jacobians, a concurrent work (Johansson et al., 2022) recently proposes to use power iteration to find spectral norms. However, as aforementioned, convergence of power iteration is slow and not guaranteed.

Research regarding spectral norm minimization is still at an early stage. Lanczos-based spectral norm minimization not only is theoretically sound but also empirically surpasses existing methods by a large margin (see Sec. 4.4).

3 Methodology

3.1 Spectral Norm Minimization

We start our exposition by formulating the problem of training Jacobian and Hessian matrices into zero, using spectral norm minimization. Subsequently, we outline conditions under which spectral norm minimization can be performed.

We consider a matrix . It can either be a Jacobian or a Hessian matrix resulting from a neural network. Our exposition does not depend on the particular width and height of

. Specifically, we make a trivial assumption that the neural networks are smooth and in turn their Hessians are symmetric. In our experiments, we use Softplus activation function

(Nair and Hinton, 2010) to ensure smoothness.

Given

, to train it into a zero matrix, the common idea is to minimize its Frobenius norm,

However, direct minimization of requires an exact construction. This is usually impractical for the Jacobians and Hessians of neural networks with high-dimensional inputs or outputs.

Fortunately, by minimizing the spectral norm , can be properly minimized. Consider the following norm equivalence:

where is the rank of matrix It shows that as , becomes an increasingly tight bound. Also, if and only if . Therefore, we minimize instead.

We take notice that the spectral norm of is the maximum singular value , which is by definition equal to , where is the maximum eigenvalue of in terms of magnitude. It is convenient to note that is symmetric and positive semi-definite, for any matrix . Also, since is symmetric, we have

where is the normalized eigenvector corresponding to .

So far we have transformed spectral norm minimization into minimizing . For this purpose, we take two steps: 1) We obtain by solving the extremal eigenvalue problem for . 2) Given , we minimize .

For 1), we use Lanczos algorithm to solve for . For now, we focus on the conditions that should satisfy, and elaborate other details in (Sec. 3.4). Lanczos algorithm operates on Hermitian matrices and requires the matrices to admit efficient matrix-vector products. The first condition is met since is always symmetric. The second condition requires the existence of an efficient operator.

For 2), we minimize , given . For Jacobians and Hessians, the minimization is made possible by VJP, JVP, and HVP. It follows that should permit an efficient vector-matrix product operator.

The following proposition summarizes the conditions under which can be optimized by spectral norm minimization.

Proposition 1.

Matrix can be optimized by spectral norm minimization if both of the following satisfies:
1) can be efficiently computed.
2) can be efficiently computed.

We conclude this section by quickly validating the satisfiability of the conditions for Jacobian and Hessian matrices of neural networks.

For a Jacobian , we have . It can be efficiently computed using VJP and JVP operators. Also, by the definition of VJP, we can efficiently compute .

For a Hessian , we note that it is symmetric since we assume smooth neural networks. Therefore, we have and . Both can be efficiently obtained given the HVP operator.

3.2 Generalized Jacobian and Hessian Regularization

In this section, we generalize the idea of minimization. The new paradigm allows training a matrix into any arbitrary target matrix , hence the name generalized Jacobian and Hessian regularization. Further, we derive conditions that should follow in order to be a valid target for spectral norm minimization.

To conform to , it is straightforward to minimize . We can efficiently minimize using spectral norm minimization, as long as follows proposition 1.

To begin with, we consider . Expanding it gives us:

We can therefore conclude the following proposition.

Proposition 2.

Matrix is an valid target matrix for spectral norm minimization if both of the following satisfies:
1) can be efficiently computed.
2) can be efficiently computed.

Proposition 2 implies that any matrix that permits an efficient left and right vector product is a valid target matrix. This ensures flexibility when choosing . For example, can be an explicit constant matrix, the Jacobian or Hessian resulting from another neural network, or any transformation of that preserves vector products (see Sec. 3.3).

3.3 Enforcing Symmetric or Diagonal Matrices

In this section, we propose the novel observation that we can enforce certain properties upon Jacobian and Hessian matrices, using spectral norm minimization. Specifically, we propose formulas that enforce symmetry or diagonality for Jacobian and Hessian matrices of neural networks, with respect to network inputs.

Symmetry. For symmetry, we consider Jacobians of neural networks whose number of inputs equals the number of outputs. In this case, Jacobians are square matrices but are generally non-symmetric (Salimans and Ho, 2021).

By the definition of symmetry, we expect . An accurate depiction of this objective is minimization. We soon notice that by making the target matrix, it is possible to enforce symmetry for square Jacobians. It is easy to validate that satisfies proposition 2, given VJP and JVP operators. Therefore, we can indeed optimize efficiently using spectral norm minimization.

In practice, to find the spectral norm, we provide

to our parallelized Lanczos algorithm. For optimization, we simply calculate

given eigenvector .

Sec. 4.3 presents empirical evidence that this technique is possible and efficient. The potential application for this objective includes ensuring conservative vector fields for Energy-based Models (EBMs). We elaborate more in Sec. 4.1.

Diagonality. For diagonality, we consider a matrix that can either be the Jacobian or the Hessian of a neural network. The only restriction we make is that should be a square matrix.

By the definition of diagonality, we should train all off-diagonal elements of into zero. This objective can be described as training to zero. On first sight, spectral norm minimization is not applicable to this problem. However, we propose the following theorem.

Theorem 1.

, the following holds

where is an all-one vector, and is a function that transforms a vector into a diagonal matrix.

The proof of the above theorem is provided in Appendix B.

Theorem 1 shows that as , becomes an increasingly tight bound. Also, if and only if . We therefore minimize instead.

Next, we validate the satisfiability of proposition 2 for . We first consider the following property of .

Property 1.

It follow that

Therefore, we can optimize efficiently using spectral norm minimization.

Sec. 4.3 presents empirical evidence that we can efficiently enforce diagonality for Hessians. The potential application for this objective includes performing disentanglement for deep generative models. We elaborate the background in Sec. 4.1.

3.4 Lanczos-Based Spectral Norm Minimization

In this section, we focus on the extremal eigenvalue problem. Specifically, given a symmetric matrix , we want to find the largest eigenvalue in terms of magnitude and its corresponding normalized eigenvector . For this purpose, we introduce our implementation of the parallelized Lanczos algorithm.

Given a batch of square matrices, denoted by , where is the batch size, and is the dimensionality of the square matrices. We construct the batched matrix-vector product function . For a batch of vectors, computes the matrix-vector products in a parallel manner.

We propose Algorithm 1, the parallelized Lanczos algorithm. After computation, Algorithm 1 returns a batch of tridiagonal matrices

and a tensor consisting of Lanczos vectors

. To obtain the normalized eigenvectors corresponding to the largest eigenvalues, we first compute the eigenvalues and eigenvectors of using traditional batched eigensolvers (Paszke et al., 2017). Since the width of each tridiagonal matrix is exactly the iteration number , the computation is negligible. Moreover, the eigenvalues of are the same as the real eigenvalues. Afterwards, can be used to map the eigenvectors resulting from to the actual eigenvectors. Through this procedure, accurate extremal eigenvalues and eigenvectors can be obtained, at the cost of only a few iterations.

A running time analysis of this algorithm is performed in Sec. 4.5.

Input:
, batched matrix-vector product function.
, batch size.
, iteration number.
, dimensionality.
Output:
where .
where .

1:Initialize as zero matrices.
2:Initialize as zero matrices.
3:Initialize as zero vectors.
4:Initialize as zero vectors.
5:Set the rows of as random unit vectors.
6:     // batched matrix-vector product
7:      // batched dot product
8:
9:for  do
10:          // batched L2 norm
11:     
12:     Set NaN rows in as random unit vectors.
13:     
14:           // batched dot product
15:     
16:end for
17:for  do
18:     
19:     if  then
20:         
21:         
22:     end if
23:end for
24:Permute the first two axes of s.t.
25:    
26:Permute the first two axes of s.t.
27:    
28:return
Algorithm 1 Parallelized Lanczos Algorithm

4 Experiments

4.1 Tasks

Overview. We experiment on four tasks that validate different aspects of our generalized Jacobian and Hessian regularization and the Lanczos-based spectral norm minimization technique.

Conservative Vector Field. Recently, Energy-based Models (EBMs) are demonstrating superior performance on tasks such as image generation (LeCun et al., 2006; Salimans and Ho, 2021; Song and Ermon, 2019)

. EBMs are traditionally scalar-valued functions that predict unnormalized probability distributions

(Salimans and Ho, 2021). In contrast, recent efforts significantly improve performance by directly predicting the gradient vectors of the distributions (Song and Ermon, 2019). This is however a paradoxical situation: vector-valued neural networks are not guaranteed to output a conservative vector field and therefore contradicts the assumptions that EBMs make (Salimans and Ho, 2021).

We approach this problem by first noting that a continuously differentiable vector field is conservative if and only if its Jacobian is symmetric. We consequently propose to minimize via our Lanczos-based spectral norm minimization to enforce symmetric Jacobians.

To validate this idea, we consider -dimensional functions of the form where

is a differentiable unary function. A feed forward neural network is used to learn the gradient field of

. The data points are sampled from . We report test time mean squared error and to demonstrate the effectiveness of our technique.

Disentanglement. Disentanglement of high-dimensional functions have wide applications in the field of deep generative models (Peebles et al., 2020). Peebles et al. (2020) propose the notion of disentanglement that is achieved by enforcing diagonal Hessian matrices of a scalar function. For this purpose, the authors propose a stochastic estimator to penalize off-diagonal elements of Hessians.

Due to Theorem 1, we propose to minimize for disentanglement. To validate this technique, we construct -dimensional functions of the form naturally has a diagonal Hessian. We use a feed forward neural network to learn the value of . The data points are sampled from . We report test time mean squared error and to demonstrate the effectiveness of Theorem 1.

Jacobian Regularization. To rigorously validate the effectiveness of our Lanczos-based spectral norm minimization technique, we conduct controlled experiments to compare with representative methods. Specifically, we implement and compare with normal training, Hutchinson’s estimator, and Power Method. Standard adversaries, namely PGD(20) is used to evaluate the performance. We perform Jacobian Regularization on CIFAR-10 and CIFAR-100 datasets (Krizhevsky and Hinton, 2009) using ResNet-18 (He et al., 2016).

Hessian Regularization. Hessian regularization concerns matrices whose size is determined by the input number of neural networks. In our case, the associated Hessian matrix is 3072 by 3072 in size, which is magnitudes bigger compared with the matrices in Jacobian regularization. Therefore, in this task we validate the performance of Lanczos-based spectral norm minimization under situations where the relating matrices are large. In particular, we implement and compare with normal training, Hutchinson’s estimator, and Power Method. We use PGD(20) to evaluate the performance, and the experiments are conducted on both CIFAR-10 and CIFAR-100 datasets (Krizhevsky and Hinton, 2009) using ResNet-18 (He et al., 2016).

Figure 1: Enforcing symmetry and diagonality using the proposed regularization terms. (a) and (b) show the results for enforcing symmetry. (c) and (d) show the results for enforcing diagonality. Symmetry is defined as . Diagonality is defined as

4.2 Implementation Details

Model Design. We make specific design choices to ensure a simplistic implementation. For activation functions, we use Softplus (Nair and Hinton, 2010) with a

value of 8 to ensure a tight and smooth approximation to ReLU

(Glorot et al., 2011). Following Dosovitskiy et al. (2021), for our ResNet-18 models (He et al., 2016)

, we replace Batch Normalization

(Ioffe and Szegedy, 2015) with Group Normalization (Wu and He, 2018) to avoid running statistics that may complicate our iteration-based Lanczos algorithm. Also following Dosovitskiy et al. (2021), standardized convolutions (Qiao et al., 2019) are used to accompany Group Normalization (Wu and He, 2018).

Hyperparameters.Hyperparameters are chosen according to a set of heuristics that do not favor any particular algorithm. The iteration number for Power Method and Lanczos algorithm starts with 2, and doubles each time the learning rate decays. The power of the regularizer starts from 25% and increases 25 percentage points each time the learning rate decays, with a maximum value of 95%. For Hutchinson’s estimator only, we report an additional variant where the regularization power is further decreased by a factor of 10. This is because Hutchinson’s estimator has a much greater magnitude compared with other methods and is unstable without the adjustment.

Training.

For all experiments, the batch size is 512. We use Adam with default parameters and a starting learning rate of 0.001 for optimization. For adversarial robustness, all experiments run for 100 epochs, and the learning rate is decayed by a factor of 10 after epoch 50, 70, and 90. For CIFAR-10 and CIFAR-100

(Krizhevsky and Hinton, 2009), random cropping and random horizontal flipping are adopted as data augmentation techniques.

4.3 Enforcing Diagonality or Symmetry

Conservative Vector Field. In this experiment, we set the function as and set the dimensionality as 1024. In Fig. 1 (a) we observe that the regularizer does not deteriorate the performance, and in Fig. 1 (b), we notice that the regularization term has a significant impact on the symmetry of the Jacobian matrix, suppressing .

Disentanglement. In this experiment, we set the function as and set the dimensionality as 1024. In Fig. 1 (c) we observe that the proposed regularization term does not decline the performance. In Fig. 1 (d), we notice that the regularizer suppresses the off-diagonal elements effectively. This result validates the effectiveness of Theorem 1.

Method Clean PGD(20)
Normal
Hutchinson
Hutchinson-0.1
Power Method
Lanczos (Ours)
Table 1:

Experiment results for Jacobian regularization on CIFAR-10. Hutchinson-0.1 means the regularization power is reduced by a factor of 10. The results are averaged over three runs and the standard deviations are reported in parentheses.

Method Clean PGD(20)
Normal
Hutchinson
Hutchinson-0.1
Power Method
Lanczos (Ours)
Table 2: Experiment results for Jacobian regularization on CIFAR-100. Hutchinson-0.1 means the regularization power is reduced by a factor of 10. The results are averaged over three runs and the standard deviations are reported in parentheses.

4.4 Comparison with Prior Works

Jacobian Regularization. In Table 1 and Table 2, we compare our Lanczos-based spectral norm minimization with normal training, Hutchinson’s estimator, and Power Method. The results show that our technique performs consistently better on all datasets.

Normal training by itself provides the best clean accuracy, however it does not provide any adversarial robustness. Hutchinson’s estimator provides 30.1% robust accuracy at the cost of a low 60.1% clean accuracy. It provides a weak trade-off between clean accuracy and robust accuracy compared with our Lanczos-based methodology. Notably, in the case of CIFAR-100, Hutchinson’s estimator is too unstable to provide a meaningful clean or robust accuracy. On both CIFAR-10 and CIFAR-100, Power Method achieves performance on par with Lanczos algorithm. We believe this is primarily because the Jacobian matrices in these experiments are small. Specifically, it is 10 by 10 for CIFAR-10 and 100 by 100 for CIFAR-100. Significant discrepancy is observed in Hessian regularization, where the Hessian matrices has a constant size of 3072 by 3072. Our Lanczos-based method consistently achieves a performance gain of 0.1% and 0.9% on both datasets.

Hessian Regularization. In Table 3 and Table 4, we compare our Lanczos-based spectral norm minimization with normal training, Hutchinson’s estimator, and Power Method. The results show that our technique surpasses other methods by a large margin.

Similar to Jacobian regularization, Hutchinson’s estimator provides subpar performance compared with Power Method and Lanczos-based spectral norm minimization. Although Hutchinson-0.1 provides a higher clean accuracy, its robust accuracy is significantly lower than that of spectral norm-based methods.

Although in Jacobian regularization, Power Method provides similar performance compared with the Lanczos algorithm, in the context of Hessian regularization its performance is significantly lower. We believe it is primarily because under Hessian regularization, the matrix is magnitudes larger than that of Jacobian regularization. In this case, Power Method is not converging as fast and accurately as the Lanczos algorithm. Our Lanczos-based method consistently achieves a performance gain of 3.7% and 2.3% on both datasets.

Method Clean PGD(20)
Normal
Hutchinson
Hutchinson-0.1
Power Method
Lanczos (Ours)
Table 3: Experiment results for Hessian regularization on CIFAR-10. Hutchinson-0.1 means the regularization power is reduced by a factor of 10. The results are averaged over three runs and the standard deviations are reported in parentheses.
Method Clean PGD(20)
Normal
Hutchinson
Hutchinson-0.1
Power Method
Lanczos (Ours)
Table 4: Experiment results for Hessian regularization on CIFAR-100. Hutchinson-0.1 means the regularization power is reduced by a factor of 10. The results are averaged over three runs and the standard deviations are reported in parentheses.
Method Stage 1 Stage 2 Stage 3 Stage 3
Hutchinson 60.2 60.2 60.2 60.2
Power Method 89.0 117.9 175.6 290.8
Lanczos 89.1 117.9 175.6 291.1
Table 5: The running time in seconds per epoch for each method. The task is Hessian regularization on CIFAR-10. Stage 1 is from epoch 1 to 50, Stage 2 is from epoch 51 to 70, Stage 3 is from epoch 71 to 90, and Stage 4 is from epoch 91 to 100.

4.5 Running Time Analysis

In Table 5 we document the running time for each of the method in our experiments. The running time is recorded on a single NVIDIA A100 GPU. The task is Hessian regularization on the CIFAR-10 dataset.

We use the Hutchinson’s method as a baseline because it uses a random vector and do not spend extra time on finding a suitable vector to perform the HVP calculation (Pearlmutter, 1994). We also note that, as mentioned in Sec. 4.1, both Power Method and the Lanczos algorithm iterates 2, 4, 8, and 16 times for Stage 1, 2, 3, and 4 respectively.

From Table 5, we draw the following conclusions. First, Power Method and the Lanczos algorithm have identical time costs. Second, for each epoch, the additional time cost introduced by the Lanczos algorithm is seconds, where is the iteration number. Third, depending on the iteration number, using the Lanczos algorithm introduces an overhead ranging from 48% to 385%. In total, there is an 120% overhead. Considering the performance gain provided by the Lanczos algorithm, it is an acceptable cost.

5 Conclusion

In this work we generalize the task of regularizing the Jacobian and Hessian matrices of neural networks. Our new paradigm not only permits arbitrary target matrices, but also allows us to explore novel regularizers that enforce symmetry or diagonality for square matrices. Further, we propose Lanczos-based spectral norm minimization, an effective technique for Jacobian and Hessian regularization. We use extensive experiments to validate the effectiveness of our novel regularization terms and the proposed algorithm. Future work includes exploring the possibility of applying the novel regularization terms on Energy-based Models that directly predicts gradient vector fields, thereby ensuring the theoretical integrity.

References

  • R. T. Q. Chen and D. K. Duvenaud (2019) Neural networks with cheap differential operators. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §4.2.
  • H. Drucker and Y. Le Cun (1992) Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks 3 (6), pp. 991–997. Cited by: §1, §1, §2.
  • X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    ,
    pp. 315–323. Cited by: §4.2.
  • G. H. Golub and H. A. van der Vorst (2000) Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics 123 (1), pp. 35–65. Cited by: §1, §2, §2.
  • S. Gu and L. Rigazio (2014) Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 770–778. External Links: Document Cited by: §1, §4.1, §4.1, §4.2.
  • M. W. Hirsch (1974) Differential equations, dynamical systems, and linear algebra. Pure and applied mathematics (Academic Press), 60. Cited by: §1.
  • J. Hoffman, D. A. Roberts, and S. Yaida (2019) Robust learning with jacobian regularization. arXiv preprint arXiv:1908.02729. Cited by: §1, §2.
  • M.F. Hutchinson (1990) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 19 (2), pp. 433–450. Cited by: §1, §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the International Conference on Machine Learning

    ,
    Vol. 37, pp. 448–456. Cited by: §4.2.
  • A. Johansson, C. Strannegård, N. Engsner, and P. Mostad (2022) Exact spectral norm regularization for neural networks. arXiv preprint arXiv:2206.13581. Cited by: §1, §2.
  • D. P. Kingma and Y. Cun (2010) Regularized estimation of image statisftics by score matching. In Advances in Neural Information Processing Systems, Vol. 23. Cited by: §2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §1, §4.1, §4.1, §4.2.
  • Y. LeCun, S. Chopra, R. Hadsell, F. J. Huang, and et al. (2006) A tutorial on energy-based learning. In PREDICTING STRUCTURED DATA, Cited by: §4.1.
  • J. Martens, I. Sutskever, and K. Swersky (2012) Estimating the hessian by back-propagating curvature. In Proceedings of the International Conference on Machine Learning, pp. 963–970. Cited by: §1, §2.
  • W. Mustafa, R. A. Vandermeulen, and M. Kloft (2020) Input hessian regularization of neural networks. arXiv preprint arXiv:2009.06571. Cited by: Appendix A, §1, §1, §1, §2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning, pp. 807–814. Cited by: §3.1, §4.2.
  • C. C. Paige (1972) Computational Variants of the Lanczos Method for the Eigenproblem. IMA Journal of Applied Mathematics 10 (3), pp. 373–381. Cited by: §1, §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    Cited by: §1, §3.4.
  • B. A. Pearlmutter (1994) Fast Exact Multiplication by the Hessian. Neural Computation 6 (1), pp. 147–160. Cited by: §1, §4.5.
  • W. Peebles, J. Peebles, J. Zhu, A. A. Efros, and A. Torralba (2020) The hessian penalty: a weak prior for unsupervised disentanglement. In Proceedings of European Conference on Computer Vision, Cited by: §1, §2, §4.1.
  • S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille (2019) Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520. Cited by: §4.2.
  • T. Salimans and J. Ho (2021) Should EBMs model the energy or the score?. In Energy Based Models Workshop - ICLR 2021, Cited by: §1, §3.3, §4.1.
  • Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §4.1.
  • Y. Song, S. Garg, J. Shi, and S. Ermon (2020) Sliced score matching: a scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pp. 574–584. Cited by: §1, §2.
  • D. Varga, A. Csiszárik, and Z. Zombori (2018) Gradient regularization improves accuracy of discriminative models. Schedae Informaticae 27. Cited by: §1, §1, §1, §2.
  • P. Vincent (2011)

    A connection between score matching and denoising autoencoders

    .
    Neural Computation 23 (7), pp. 1661–1674. Cited by: §2.
  • Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision, Cited by: §4.2.

References

  • R. T. Q. Chen and D. K. Duvenaud (2019) Neural networks with cheap differential operators. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §1.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §4.2.
  • H. Drucker and Y. Le Cun (1992) Improving generalization performance using double backpropagation. IEEE Transactions on Neural Networks 3 (6), pp. 991–997. Cited by: §1, §1, §2.
  • X. Glorot, A. Bordes, and Y. Bengio (2011) Deep sparse rectifier neural networks. In

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    ,
    pp. 315–323. Cited by: §4.2.
  • G. H. Golub and H. A. van der Vorst (2000) Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics 123 (1), pp. 35–65. Cited by: §1, §2, §2.
  • S. Gu and L. Rigazio (2014) Towards deep neural network architectures robust to adversarial examples. arXiv preprint arXiv:1412.5068. Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 770–778. External Links: Document Cited by: §1, §4.1, §4.1, §4.2.
  • M. W. Hirsch (1974) Differential equations, dynamical systems, and linear algebra. Pure and applied mathematics (Academic Press), 60. Cited by: §1.
  • J. Hoffman, D. A. Roberts, and S. Yaida (2019) Robust learning with jacobian regularization. arXiv preprint arXiv:1908.02729. Cited by: §1, §2.
  • M.F. Hutchinson (1990) A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics - Simulation and Computation 19 (2), pp. 433–450. Cited by: §1, §2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the International Conference on Machine Learning

    ,
    Vol. 37, pp. 448–456. Cited by: §4.2.
  • A. Johansson, C. Strannegård, N. Engsner, and P. Mostad (2022) Exact spectral norm regularization for neural networks. arXiv preprint arXiv:2206.13581. Cited by: §1, §2.
  • D. P. Kingma and Y. Cun (2010) Regularized estimation of image statisftics by score matching. In Advances in Neural Information Processing Systems, Vol. 23. Cited by: §2.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Cited by: §1, §4.1, §4.1, §4.2.
  • Y. LeCun, S. Chopra, R. Hadsell, F. J. Huang, and et al. (2006) A tutorial on energy-based learning. In PREDICTING STRUCTURED DATA, Cited by: §4.1.
  • J. Martens, I. Sutskever, and K. Swersky (2012) Estimating the hessian by back-propagating curvature. In Proceedings of the International Conference on Machine Learning, pp. 963–970. Cited by: §1, §2.
  • W. Mustafa, R. A. Vandermeulen, and M. Kloft (2020) Input hessian regularization of neural networks. arXiv preprint arXiv:2009.06571. Cited by: Appendix A, §1, §1, §1, §2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the International Conference on Machine Learning, pp. 807–814. Cited by: §3.1, §4.2.
  • C. C. Paige (1972) Computational Variants of the Lanczos Method for the Eigenproblem. IMA Journal of Applied Mathematics 10 (3), pp. 373–381. Cited by: §1, §1.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    Cited by: §1, §3.4.
  • B. A. Pearlmutter (1994) Fast Exact Multiplication by the Hessian. Neural Computation 6 (1), pp. 147–160. Cited by: §1, §4.5.
  • W. Peebles, J. Peebles, J. Zhu, A. A. Efros, and A. Torralba (2020) The hessian penalty: a weak prior for unsupervised disentanglement. In Proceedings of European Conference on Computer Vision, Cited by: §1, §2, §4.1.
  • S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille (2019) Micro-batch training with batch-channel normalization and weight standardization. arXiv preprint arXiv:1903.10520. Cited by: §4.2.
  • T. Salimans and J. Ho (2021) Should EBMs model the energy or the score?. In Energy Based Models Workshop - ICLR 2021, Cited by: §1, §3.3, §4.1.
  • Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, Vol. 32. Cited by: §4.1.
  • Y. Song, S. Garg, J. Shi, and S. Ermon (2020) Sliced score matching: a scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence, pp. 574–584. Cited by: §1, §2.
  • D. Varga, A. Csiszárik, and Z. Zombori (2018) Gradient regularization improves accuracy of discriminative models. Schedae Informaticae 27. Cited by: §1, §1, §1, §2.
  • P. Vincent (2011)

    A connection between score matching and denoising autoencoders

    .
    Neural Computation 23 (7), pp. 1661–1674. Cited by: §2.
  • Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision, Cited by: §4.2.

Appendix A Gradient Ascent and Power Method at Finding Spectral Norm

For any matrix , to find the unit vector that correspond to the spectral norm of , one may use Power Method. It is defined by recurrence

Mustafa et al. (2020) however propose to use gradient ascent to find the that maximizes . Given step size , this method is defined by recurrence

In this section, we show that this method is closely related to Power Method, and they can be practically equivalent.

We first notice that

(3)

Eq. (1) has two implications. First, to find the that maximizes , there is no need to to differentiate with respect to . It suffices to instead perform matrix-vector products. Second, vector is proportional to , this strongly relates to the first step of Power Method.

To elaborate more on the second implication, we formulate

as

Next, we note that , where is a constant value once is fixed. Therefore, suppose that we choose to be sufficiently large, for example . Even in the extreme case where is orthogonal to , after normalization,

still has a significant cosine similarity of at least

with . To conclude, from a theoretical perspective, we believe there is no obvious reason to promote gradient ascent over Power Method.

Appendix B Proof of Theorem 1

In this section, we give the proof of Theorem 1.

Theorem 1.

, the following holds

where is an all-one vector, and is a function that transforms a vector into a diagonal matrix.

Proof.

Suppose that

It suffices to show

We observe and , where has value 1 at entry and has value 0 at other entries. Consequently, We note that has value 0 at entry and has value 1 at other entries, therefore , and that .

Since , and are orthogonal, by the Pythagorean theorem and the Cauchy–Schwarz inequality,

It is trivial that

Finally, we have

Therefore,