Log In Sign Up

Regularizing Deep Neural Networks with Stochastic Estimators of Hessian Trace

by   Yucong Liu, et al.

In this paper we develop a novel regularization method for deep neural networks by penalizing the trace of Hessian. This regularizer is motivated by a recent guarantee bound of the generalization error. Hutchinson method is a classical unbiased estimator for the trace of a matrix, but it is very time-consuming on deep learning models. Hence a dropout scheme is proposed to efficiently implements the Hutchinson method. Then we discuss a connection to linear stability of a nonlinear dynamical system and flat/sharp minima. Experiments demonstrate that our method outperforms existing regularizers and data augmentation methods, such as Jacobian, confidence penalty, and label smoothing, cutout and mixup.


page 1

page 2

page 3

page 4


A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization

Loss landscape analysis is extremely useful for a deeper understanding o...

Non-Convex Optimization with Spectral Radius Regularization

We develop a regularization method which finds flat minima during the tr...

Dropout: Explicit Forms and Capacity Control

We investigate the capacity control provided by dropout in various machi...

A Hessian Based Complexity Measure for Deep Networks

Deep (neural) networks have been applied productively in a wide range of...

Input Hessian Regularization of Neural Networks

Regularizing the input gradient has shown to be effective in promoting t...

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

The early phase of training has been shown to be important in two ways f...

The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement

Existing disentanglement methods for deep generative models rely on hand...

1 Introduction

Deep neural networks (DNNs) are developing rapidly and are widely used in many fields such as reflection removal (removal), dust pollution (dust), building defects detection (building), cities and urban development (city). As more and more models are proposed in the literature, deep neural networks have shown remarkable improvements in performance. However, among various learning problems, over-fitting on training data is a great problem that affects the test accuracy. So a certain regularization method is often needed in the training process.

In linear models, Ridge Regression

(ridge) and Lasso (lasso) are usually used to avoid over-fitting. They are also called and regularization. regularization has the effect of shrinkage while regularization can be conductive to both shrinkage and sparsity. From the Bayesian perspective, and regularization can be interpreted with normal prior distribution and laplace prior distribution respectively.

Apart from and regularization, there are many other forms of regularizers in DNNs. The most widely used one is Weight-Decay (weightdecay). l2wd also showed that regularization and Weight-Decay are not identical. Dropout (dropout) is another method to avoid over-fitting by reducing co-adapting between units in neural networks. Dropout has inspired a large body of work studying its effects (wager2013dropout; helmbold2015inductive; wei2020implicit). After dropout, various regularization schemes can be applied additionally.

In this paper, we propose a new regularization by penalizing the trace of second derivative of loss function. We refer to our regularization method as Stochastic Estimators of Hessian Trace (SEHT). On one hand, our hessian regularization is valuable to guarantee good generalization. On the other hand, from the perspective of dynamical system, it influences the flatness/sharpness of the equilibrium point in the system, in which parameters move in the parameter space on the basis of training data. In our experiments, Hessian regularization shows competitive test performance even outperforms some of the SOTA methods mentioned above. SEHT demonstrates low time consumption with our stochastic algorithm as well.

2 Related Work

There are many regularization methods in previous work. Label Smoothing (label)

estimates the marginalized effect of label-dropout and reduces over-fitting by preventing a network from assigning full probability to each training example. Confidence Penalty

(confidence) prevents peaked distributions, leading to better generalization. A network appears to be overconfident when it places all probability on a single class in the training set, which is often a symptom of over-fitting. DropBlock (dropblock) is a structured form of dropout, it drops contiguous regions from a feature map of a layer instead of dropping out independent random units.

Data augmentation methods are also used in practice to improve model’s accuracy and robustness when training neural networks. Cutout (cutout) is a data augmentation method where parts of the input examples are zeroed out, in order to encourage the network to focus more on less prominent features, then generalize to situations like occlusion. Mixup (mixup)

extends the training distribution by incorporating the prior knowledge that linear interpolations of feature vectors should lead to linear interpolations of the associated targets.

jacobian first proposed Jacobian regularization, a method focusing on the norm of Jacobian matrix with respect to input data. It was proved that generalization error can be bounded by the norm of Jacobian matrix. Besides that, Jacobian matrix shows improved stability of the model predictions against input perturbations according to Taylor expansion. hoffman2019robust showed that Jacobian regularization enlarges the size of decision cells and is practically effective in improving the generalization error and robustness of the models. To simplify calculation, stochastic algorithm of Jacobian regularization was also proposed.

Motivated by Jacobian regularization, we consider the generalization error and stability of the model respect to Hessian matrix. Then we combine Linear Stability Analysis and propose Hessian regularization with corresponding stochastic algorithms. We compare our Hessian regularization with other methods and demonstrate promising performance in experiments. The main idea to estimate the trace of Hessian matrix is Hutchinson Method(trace) and the algorithm was also discussed by pyhessian

. We make an improvement by designing a new probability distribution to dropout parameters which decrease time consumption obviously without losing generalization.

Hessian information is powerful tool used on analyzing the property of neural networks. adahessian designed AdaHessian, a second order stochastic optimization algorithm. yushixing used Hessian trace to measure sensitivity, developing a Hessian Aware Pruning method to find insensitive parameters in a neural network model and a Neural Implant technique to alleviate accuracy degradation. However, their methods are static in essence, we focus on dynamical motion of parameters in parameter space. yiyangdenage also proposed a Hessian regularization. They focused on the layerwise loss landscape via the eigenspectrum of the Hessian at each layer. We start from different perspectives, generalization error and dynamical system of parameters. Our experiments also shows better results than yiyangdenage’s method.

The literature of 1997flat, large, sharp

observed that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. And sharp minima lead to poorer generalization. These minimizers are characterized by a significant number of large positive eigenvalues in Hessian matrix, and tend to generalize less well. They found that small-batch methods, compared with large-batch methods, converge to flat minimizers characterized by having numerous small eigenvalues of Hessian matrix. Our newly designed regularization method enables the neural network to find a more flat minima for achieving a better performance.

3 Hessian Regularization

Here we introduce our Hessian regularization method based on generalization error and corresponding stochastic algorithms in details. Then we discuss linear stability analysis to explain why Hessian regularization can prevent neural networks from over-fitting.

3.1 Trace of Hessian Matrix

In this study, we consider a multi-class classification problem. Input is a N-dimensional vector, where , with is the input space. is the label space, which means that we have M classes. Each input has a label . Sample space is defined as . An element of is denoted by . We assume that samples from are drawn according to a probability distribution . A training set of n samples drawn from is denoted by .

Our goal is to find a classification function , which takes as input and outputs . is a M-dimensional score vector, where each element is the score belonging to category . The highest score indicates the most probable label. So the estimated label is given as .

A loss function is used to measure the discrepancy between the true label and the estimated label . In this paper, we use the cross-entropy loss,


The empirical loss of the classifier

associated with the training set is defined as


and the expected loss of the classifier is defined as


then the difference between and is called generalization error:


wei2020implicit showed generalization bound of linear models with cross-entropy loss of M classes. Let is the weight matrix, and . For linear models, the Jacobian matrix is a vector defined as and the Hessian matrix is defined as . With probability over the training examples, for all weight matrices satisfying the norm bound , the following bound holds:


Here with some fixed bound ,


So one can guarantee good generalization when the trace of Hessian matrix and norm of Jacobian matrix are small. On one hand, when learning with gradient descent, we want to find a local or global minima of loss function. Naturally, at minima the gradient is zero and the norm of Jacobian matrix is small near minima. So gradient descent helps us to ensure the norm of Jacobian Matrix to be small.

On the other hand, the trace of Hessian Matrix is hard to be constrained by gradient descent. From this aspect, we proposed Hessian regularization for linear models as:


It’s the trace of second derivative of loss with respect to output of linear model

, which is also the end nodes of a linear model. A DNN is consist of many layers, with each layer being viewed as a linear model (except the nonlinear activation functions). Thus. we generalize the Hessian regularization to every node in a DNN and define it as


It’s the trace of second derivative of loss with respect to parameters .

Here we define a new loss with our Hessian regularization as


where controls the strength of our Hessian regularization.

3.2 Hutchinson Method

In a typical DNN, there are more than millions of parameters. So the calculation of Hessian matrix is difficult. Hutchinson Method (trace) is an unbiased estimator for the trace of a matrix. Let be an symmetric matrix with . Let

be a random vector whose entries are i.i.d Rademacher random variables (

), then is an unbiased estimator of , based on the following equation:


In this paper, we consider the trace of Hessian matrix , which is the second derivative matrix. Since the Rademacher random vector is irrelevant to network parameters, we expand the expression of Hutchinson estimator as follow:


Based on Equation 11 and , to obtain of the Hessian regularizer, we only need to calculate the gradient of loss w.r.t. the weights , and the derivative of w.r.t. the weights. We don’ need the prohibitive computation of the whole Hessian matrix. The fast second-order information estimation only includes two inner products and two derivations. We refer to the Hutchinson stochastic estimator of Hessian trace as SEHT-H, presented in Alg. 1.

Input: n-dimensional gradient
Output: Estimation of
for  to  do
end for
Algorithm 1 SEHT-H

Even though SEHT-H is a stochastic algorithm which reduces considerable amount of computational overhead, it is still not efficient enough due to the great number of parameters of a neural network. Hence, we propose to modify the pipeline of SEHT-H based on the basic idea of Dropout. The Dropout method boosts its computation speed and make it more efficient and applicable for neural network training.

3.3 Dropout Method

Inspired by Dropout  (dropout), we then propose a stochastic parameter method to accelerate SEHT-H. In Dropout, every node in the neural network has a probability to be ignored in the training process to reduce co-adaptations. Thus, in each training iteration, only a random sub-network of the original network is used. Intuitively, in the process of Hessian trace calculation, not all the parameters are necessary to be considered. it would be much faster if we only use a small subset of the weights during each calculation. Thurs, we ignore some parameters when constraining Hessian trace . The sum of the selected subset of diagonal elements is denoted as . Considering the layer structures of neural networks, the process of randomized parameter selection can be divided into two steps: randomly select layers in neural network with probability . randomly select parameters in the selected layers with probability . In other words, when carrying out Hessian regularization, we ignore layers with probability , ignore parameters in the selected layers with probability . In our experiment, we simply set .

Following the basic idea of Hutchinson algorithm, we want to obtain Hessian trace without the heavy calculation of Hessian matrix. To extend the algorithm from full-parameter domain to partial-parameter domain, here we define a new probability distribution (if , then and ). Then, let be a random vector whose entries are i.i.d random variables,


Here , a diagonal matrix with diagonal elements equal to 0 or 1.

Similar to Equation 10, if we fix the position of 0 in , we have unbiased estimator of the partial sum of diagonal elements:


We can expand the expression same as Equation 11 and transform the calculation process into two inner product and two derivation. This efficient method with random selected subset parameter calculation of Hessian trace presented in Alg. 2. We name it as SEHT-D.

Input: probability p, parameter in selected layers, and corresponding n-dim gradient
Output: Estimation of
for  to  do
end for
Algorithm 2 SEHT-D

SEHT-D makes Hessian regularization realistic and applicable for NN training. Thurs, in practice, we mainly provide results on SEHT-D. Compared with other regularization methods, our Hessian regularization shows improved test performance with fast training speed.

3.4 Linear Stability Analysis and Flat Minima

Optimization of weights during training process can be regarded as a motion in the parameter space, from the initialization landscape to a local or global minima. Each parameter value correspond to a point in the parameter space. In each update iterations, gradient descent serve as the move of the parameter point. This view provides us another perspective to see gradient descent. Original gradient descent is defined as a series of discrete updates:


Here is the parameters in step t, is learning rate and is gradient.

If we consider learning rate as discrete time interval to move in parameter space, , then


We assume time interval or learning rate is small enough, approximately we get a contiunous form:


Thereafter, with an initial condition, we have the complete trajectory of parameter point based on ordinary differential equation (ODE) theory. The process of gradient descent is transformed to a Nonlinear Dynamical System. So we introduce Linear Stability Analysis of Nonlinear Dynamical Systems below.

Nonlinear Dynamical System (mathmodel) is defined as a differential function with an initial condition . is the vector of state variables, describes properties of interest in the system, evolving for times and starting from a given initial state . The rate functions for each , , have similarly been collected in a vector , where each can potentially depend on all of the state variables. Since doesn’t have to be a linear function, the system is called Nonlinear Dynamical System. A classical example of a dynamical system from mechanics is the system for motion of a particle. A system is called non-autonomous if the rate functions have an explicit dependence on time. In this paper, we only focus on autonomous systems,


The equilibrium points are defined by positions where rate functions vanish, . If any solution starting near an equilibrium point leaves the neighbourhood of as , then is called asymptotically unstable, while if all solutions starting within the neighbourhood approach as then the equilibrium is called asymptotically stable. lyapunovstablity gave more rigorous definition and discussion, known as Lyapunov Stability Theory. The idea of Lyapunov Stability can be extended to infinite-dimensional manifolds, where it is known as Structural Stability (structural), which concerns the behavior of different but ”nearby” solutions to differential equations.

For a basic Nonlinear Dynamical System, to ensure equilibrium point is stable, we need to construct Jacobian matrix . Given the equilibrium point , is a constant matrix. Using the conclusion from Linear State Space Model, if all eigenvalues of have real parts less than zero, then is Lyapunov stable. It means it can be easily converged in the dynamical system.

Back to gradient descent, it can be regarded as a Nonlinear Dynamical System, according to Equation 16. The local or global minima, which is the goal of gradient descent, is the equilibrium point in such system, since that minimum point satisfies the condition . The Jacobian matrix in this dynamical system is the negative of the Hessian matrix in our Hessian regularization, , with and . And the trace and the eigenvalues of are also negative of the trace and the eigenvalues of .

It’s easy to see that in Hessian regularization of SEHT, we lower the trace of Hessian Matrix , thus increasing the trace of the Jacobian matrix . However, in real matrix, complex eigenvalues are always conjugate and the trace are always real number. When we increase the trace, we increase the real parts of eigenvalues of to some extent. In other word, the goal is to preclude the Lyapunov stability of equilibrium point by our Hessian regularization.

We need to be clear that, though it is called ”Stability”, Lyapunov stability is only a local property of equilibrium point in Dynamical System. It doesn’t guarantee the stability of our algorithm. When we impose a penalty on hessian trace, we do not add instability onto the training process and the structure of the network.

Then why we impose a penalty on Hessian trace and want this Lyapunov instability in this dynamical system of gradient descent? Besides, it’s crucial to understand how does this Hessian regularization work.

First of all, Lyapunov instability help us to find flat minima. In Lyapunov Stability Theory, this stability means that solutions starting ”close enough” to the equilibrium (within a distance from it) remain ”close enough” forever (within a distance from it) or eventually converge to the equilibrium. Furthermore, this stability is an estimate of how quickly the solutions converge in the Dynamical System. Thurs, in general, the stability of the minima means that it’s easily-converged, which is more likely to be a sharp minima. With numerical experiments, large showed that large-batch methods, which has large positive eigenvalues in second derivative of loss function , same as in our discussion, tend to generalize less well. They also observed that the loss function landscape of deep neural networks is such that those large positive eigenvalues methods are attracted to regions with sharp minimizers and that, unlike small-batch methods with smaller eigenvalues, are unable to escape basins of attraction of these minimizers. This observation is consistent with the property in Lyapunov stability theory. Therefore, when we choose a more unstable equilibrium, we are more likely to get a flat minima. sharp discussed that this Hessian trace is exactly a measure of flatness. In a word, regularizing Hessian trace helps us to find flat minima.

Secondly, our hessian regularization help us to rethink about the minima we can easily get. The idea of Dynamical System concerns about whether it can converge to some solutions and how fast convergence is. Generally, we don’t compare different solutions. However, in our task, we need to find a ”solution” that have better generalization. Similar with the idea of Label Smoothing and Confidence Penalty, the easily-converged equilibrium point may not be the best one. The more ”stable” equilibrium point may not be trustworthy. We need to be less confident in order to get a better equilibrium point. Our Hessian regularization helps us to escape from the sharp minima, which has higher positive eigenvalues and worse generalization, to search again in the parameter space. Then we can get to a local minima from another local minima. As a result, it’s easier to come to a better minima by gradient descending and achieve better generalization.

Thirdly, the stability of local or global minimum shows the stability toward training data. The whole dynamical system of gradient descent is a motion based on training set since the motion of parameters is determined by Equation 17. Equation 17 is consist of two parts: one is the initial condition, which is randomized in a DNN, while the other part is the differential function in Equation 16, which is directly determined by training data. In other words, with different training data, the parameters have different trajectory in parameter space. It only use information about training data and the equilibrium point depends on training data. Therefore, the Lyapunov stability can be a symptom of over-fitting and reducing the stability to some extent can avoid over-fitting to training data.

4 Experiments

We evaluate the proposed Hessian regularization method (abbreviated as SEHT) on both computer vision and natural language processing. Specifically we implement the SEHT on the task of image classification and language modeling respectively which are commonly used to measure the effectiveness of regularization.

For comparison across a variety of datasets, we adopt several strong regularization methods and data augmentation methods, e.g., Confidence Penalty (confidence), Label Smoothing (label), Cutout (cutout), MixUp (mixup)

. To implement past methods on our target backbone for fair comparison on each task, we’ve done a thorough hyperparameter search to extract the best performance of each regularization method.

4.1 Image Classification

4.1.1 Cifar-10

The CIFAR-10 dataset consists of 60000 instance of 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. On CIFAR-10 experiment, we use ResNet-18

(resnet) as the backbone neural network.

For all models, we use weight decay of

. We set learning rate 0.01, batch size 32, momentum 0.9 and all models were trained 200 epochs with Cosine Annealing 

(loshchilov2016sgdr). For Jacobian regularization, we set number of projections and weight parameter . For DropBlock, and . For Confidence Penalty and Label Smoothing, we perform a grid search over weight values {0.001,0.005,0.01,0.05,0.1}. Finally, we select 0.01 for Label Smoothing and select 0.001 for Confidence Penalty.

For our Hessian regularization SEHT-D, we perform a grid search over weight values {0.0001, 0.001, 0.01, 0.1} and select 0.001, testing with probability value 0.01 and 0.05. We test different maxIters in {1, 5, 10}. MaxIter is the times of loops, defined in 1 and 2. We also test SEHT-H(maxIter=5), with weight value in {0.0001, 0.001, 0.01, 0.1}. and select 0.001 for SEHT-H. Cutout size is set to based on the validation results mentioned by cutout. For Mixup, we set , according to mixup’s setting, which results in interpolations uniformly distributed between 0 and 1. A. The mean and standard error over 5 random initialization of the top-1 accuracy is reported.

Model Test Accuracy
Baseline with Weight-Decay
Confidence Penalty
Label Smoothing
Vanilla (bregman)
Middle Network Method (yiyangdenage)
MM + FRL (fine)
Lookahead (look)
SEHT-D(maxIter=1, prob=0.01)
SEHT-D(maxIter=10, prob=0.01)
SEHT-D(maxIter=10, prob=0.05)

Table 1: Results of ResNet-18 on CIFAR-10 over 5 runs
(a) Convergence on CIFAR-10
(b) Test Accuracy of SEHT-D on CIFAR-10
Figure 1: Performance of SEHT-D on CIFAR-10

Our first observation is on the computational cost of the training with SEHT-D (maxIter=1, prob=0.01). SEHT-D(maxIter=1, prob=0.01) only requires training time of the baseline. Although increasing the probability to select parameters can improve test accuracy, the time consumption will increase a lot. In our experiment, SEHT-D (maxIter=1, prob=0.05) costs of the baseline. To fully extract the maximum ability of Hessian trace regularization, we implement the full-parameter version SEHT-H which is quite slow but more accurate on Hessian estimation. We also try to calculate the real Hessian trace. For the hessian trace of only one layer parameters of resnet-18, it costs almost 2 minutes to train one batch (size 32 of training examples). The demonstration of the prohibitive cost to calculate the real Hessian trace justify us on utilizing stochastic estimators which is much more efficient and barely loses accuracy.

Secondly, it’s surprising to find out that Jacobian regularization and Dropblock method have even worse performance than the baseline with Weight-Decay.

Confidence Penalty and Label Smoothing obtains similar results with competitive top-1 accuracy improvement and small variance across experiments. SEHT-D has the best result no matter what maxIter or parameter probility selected which significantly outperforms all the other past methods except Mixup, while the setting (maxIter=10, prob=0.05) obtains the best

In yiyangdenage’s experiment, they got test accuracy 88.13 on CIFAR-10 with ResNet-18, which is much worse than our result: 95.37 with SEHT-D (maxIter=1, prob=0.01) and 94.49 with SEHT-D (maxIter=10, prob=0.05). Moreover, their improvement based on their methods is only 0.02 for full-network method and 0.10 for middle-network method. Our Hessian regularization method improves the model 1.37 on test accuracy with SEHT-D (maxIter=1, prob=0.01) and improves 1.49 on test accuracy with SEHT-D (maxIter=10, prob=0.05), which is much more than their improvement.

In Figure 1, we compared the convergence speed and test accuracy on SEHT-D and SEHT-H with different parameters. Our SEHT-D and SEHT-H have faster convergence than baseline. When increasing maxIter and prob, our SEHT-D shows better performance.

4.1.2 Cifar-100

CIFAR-100 dataset is similar with the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class.

We use Wide Residual Networks (WRN) (WRN) as the backbone neural network. We use WRN-28-10 specifically, with depth 28 and fixed widening factor of 10. For all models, we use Weight Decay of . We set batch size 32, momentum 0.9 and all models were trained 200 epochs. The learning rate is initially set to 0.1 and is scheduled to decrease by a factor of 5 after each of the 60th, 120th, and 160th epochs. We test Dropout with a drop probability of , determined by WRN’s cross-validation. For Confidence Penalty, Label Smoothing and our Hessian regularization, we set weight value 0.1. Cutout size of 8 × 8 pixels is used according to cutout’s validation results. For mixup, we still set . We perform a grid search over weight values0.0001,0.001,0.01,0.1 for Label smoothing, confidence penalty and out SEHT. We also combine these three methods with Dropout. 0.0001 works best for confidence penalty with Dropout and 0.001 without Dropout. 0.001 works best for label smoothing with Dropout and 0.0001 without Dropout. 0.01 works best for SEHT with Dropout and 0.001 without Dropout. We report the mean and standard error of the mean over 5 random initialization.

In this experiment, our SEHT-D((maxIter=1, prob=0.01) method shows better results on both top-1 accuracy and top-5 accuracy, improving 4.92 and 2.52 respectively. SEHT-D((maxIter=1, prob=0.05) improves 5.70 and 2.48 respectively, out preform all other methods tested. When testing together with Dropout, our SEHT-D has lower accuracy, which means it may have not good combination with Dropout.

Model Top-1 Acc Top-5 Acc
Confidence Penalty
Label Smoothing
SEHT-D(maxIter=1, prob=0.01)
SEHT-D(maxIter=1, prob=0.05)

Confidence Penalty + Dropout
Label Smoothing + Dropout
SEHT-D(maxIter=1, prob=0.01) + Dropout
Table 2: Results of WRN-28-10 on CIFAR-100 over 5 runs

4.2 Language Modeling

4.2.1 Wiki-text2

The Wiki-Text language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia.

We use a 2-layer LSTM LSTM

. The size of word embeddings is 512 and the number of hidden units per layer is 512. We run every algorithm for 40 epochs, with batch size 20, gradient clipping 0.25, and Dropout ratio 0.5. We perform a grid search over Dropout ratios {0, 0.1, 0.2, 0.3, 0.4, 0.5} and find 0.5 to work best. We tune the initial learning rate from {0.001, 0.01, 0.1, 0.5, 1, 10, 20, 40} and decrease the learning rate by factor of 4 when the validation error saturates. We find initial learning rate 20 works best. Parameters are initialized from a uniform distribution

. For label smoothing, we perform a grid search over weight values {0.001, 0.005, 0.01, 0.05, 0.1} and find 0.01 to work best. For the confidence penalty, we perform a grid search over weight values {0.001, 0.005, 0.01, 0.05, 0.1} and find 0.01 to work best. For our Hessian regularization, we perform a grid search over weight values {0.001, 0.005, 0.01, 0.05, 0.1}, probability values {0.01, 0.05}. Weight 0.01 and probability 0.05 work best. We report the mean and standard error of the mean over 5 random initialization.

In this experiments with LSTM, our SEHT-D has the best test perplexity and Label Smoothing shows best validation perplexity. SEHT-D improves the model 2.61 on test perplexity. Confidence Penalty performs only sightly better than the baseline method.

Model Valid ppl Test ppl
Confidence Penalty
Label Smoothing
Table 3: Results of LSTM on Wiki-Text2 over 5 runs (lower is better)

We also tested with a 2-layer GRU (GRU). The size of word embeddings is 512 and the number of hidden units per layer is 512. We run every algorithm for 40 epochs, with batch size 20, gradient clipping 0.25 and Dropout ratio 0.3. We perform a grid search over Dropout ratios {0, 0.1, 0.2, 0.3, 0.4, 0.5} and find 0.3 to work best. We tune the initial learning rate from {0.001, 0.01, 0.1, 0.5, 1, 10, 20, 40} and decrease the learning rate by factor of 4 when the validation error saturates. We find initial learning rate 20 works best, same as LSTM. Parameters are initialized from a uniform distribution . For label smoothing, we perform a grid search over weight values {0.001, 0.005, 0.01, 0.05, 0.1} and find 0.05 to work best. For the confidence penalty, we perform a grid search over weight values {0.001, 0.005, 0.01, 0.05, 0.1} and find 0.005 to work best. For our Hessian regularization, we perform a grid search over weight values {0.001, 0.005, 0.01, 0.05, 0.1}. Weight 0.001 works best. We set probability values 0.01. We report the mean and standard error of the mean over 5 random initialization.

Our Hessian regularization method has both the best validation perplexity and the best test perplexity, improving 2.83 and 2.61 respectively compared with baseline method. Confidence Penalty surpasses Label Smoothing with GRU model, compared with LSTM. Label Smoothing also show better results than baseline.

Model Valid ppl Test ppl
Confidence Penalty
Label Smoothing
Table 4: Results of GRU on Wiki-Text2 over 5 runs (lower is better)

Our experiments on Language Modelling demonstrate that all these three regularization methods can improve models, while our SEHT-D is the best.

5 Conclusion

We propose a new regularization method named as Stochastic Estimators of Hessian Trace (SEHT). Our method is motivated by a guarantee bound that a lower trace of the Hessian can result in a lower generalization error. To simplify computation, we implement our SEHT method with SEHT-H and SEHT-D. We also explained our method with dynamical system theory and flat/sharp minima. Our experiment shows that SEHT-D and SEHT-H yields promising test performance with fast training speed. Particularly, our SEHT method achieves accuracy on CIFAR-10 with resnet-18, outperforming classical methods like Label Smoothing and Confidence Penalty.