k-decay: A New Method For Learning Rate Schedule

04/13/2020 ∙ by Tao Zhang, et al. ∙ 57

It is well known that the learning rate is the most important hyper-parameter on Deep Learning. Usually used learning rate schedule training neural networks. This paper puts forward a new method for learning rate schedule, named k-decay, which suitable for any derivable function to derived a new schedule function. On the new function control degree of decay by the new hyper-parameter k, while the original function is the special case at k = 1. This paper applied k-decay to polynomial function, cosine function and exponential function gives them the new function. In the paper, evaluate the k-decay method by the new polynomial function on CIFAR-10 and CIFAR-100 datasets with different neural networks (ResNet, Wide ResNet and DenseNet), the results improvements over the state-of-the-art results on most of them. Our experiments show that the performance of the model improves with the increase of k from 1.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning [lecun2015deeplearning]

is widely used in image recognition, speech recognition and many other fields. Now the deep neural network have convolutional neural networks for image

[lin2013network, hu2017squeezeandexcitation, journals/corr/DongLHT15, goodfellow2014generative]

, recurrent neural network for speech

[devlin2018pretraining] and graph neural networks for graph [wu2019comprehensive, kipf2016semisupervised]. In these fields, deep neural networks have obtained unprecedented success, but it is far from enough. With the rapid development of technology, our model becomes more and more complex, need to consume more computer resources.So less computing resources obtain better models is our research purpose. on the one hand, develop an efficient neural network. For example, the architecture of the convolutional neural networks from VGG [simonyan2014convolutional] to resnet [he2016deep], from resnet to DenseNet [inproceedings] , its performance is improved and the calculation is less. on the other hand, find out an efficient optimization training neural networks. For example, replace the SGD with Momentum training higher performance model at the same time.

Figure 1: Test erros on CIFAR10 with ResNet-20, used diff on polynomial decay function, zoom in 180-200.

In deep neural networks update the parameters used by gradient descent algorithms. The formulated is

(1)

which is the parameters, is the learning rate,

is the loss function. The

controls the parameters update speed. When the is large, the model convergence speed very quickly, but skip local minimum value. When the is small, it can find local minimum value, but the model convergence speed slowly. The learning rate schedule from large decay to small , can be solved the conflict. Usually using the monotone decreasing function for the learning rate schedule. For instance, polynomial decay function, cosine decay function, exponential decay function.

In Figure 1, the test error rate curve is used the polynomial decay function of different exponents as the learning rate schedule to training ResNet-20 () on CIFAR10 dataset. With the increase of

, the change rate of the errors decreases in the late(the epoch at 180~200). The reason was that with the increase of

, the change rate of the learning rate decreases with time in the late. When is 2,3,4, the change rate of the learning rate closed to 0, lead to the errors not changed in the late. At , the change rate of the learning rate is slow, caused by the change rate of error rate is slow. According to this, there was a positive correlation between the change rate of the errors and the change rate of the learning rate. At , the change rate of the learning rate is slowly in the late. If an increase the change rate of learning rate schedule in the late at , it will make the change rate of the errors increase in the late to improve the performance. At , the change rate of the learning rate is slowly in the early. If decrease the change rate of learning rate schedule in the early at , it will make the change rate of the errors increase in the early to improve the performance. And how to change the change rate of learning rate on the learning rate schedule is the paper to solve.

This paper proposed the k-decay method to change the change rate of learning rate by change the k order derivative of function. The function of learning rate schedule is , its derivative function is . And the new derivative function is

(2)

which at . The is the increment of .It can be control and change the change rate of in the k-order. Solve the primitive function of as the new function for the learning rate schedule. You can see that the number of new functions at least .

Figure 2: Test erros on CIFAR10 with ResNet-20, set diff on the new polynomial decay function() by k-decay ,zoom in 180-200.

In Figure 2, shown test errors on CIFAR10 with ResNet-20 (), used the new polynomial function based on k-decay, the errors decreases with the increase of . The change rate of the errors increases with the increase of in the late on the error curve, caused by the change rate of the learning rate increase with on the new function in the late. The original function is the special case at . It is proved that the change of the learning rate schedule based on the k-decay method be correct.

2 Related Work

In recent years, find many studied optimization based on learning rate. On the one hand, used the better function of the learning rate schedule to role the global learning rate change with time in the optimal algorithm. For example, common use piecewise functions, polynomial function, stochastic gradient descent with restarts(SGDR)

[loshchilov2016sgdr], Cyclical Learning Rates(CLR) [conf/wacv/Smith17] , Hyperbolic-Tangent Decay(HTD) [conf/wacv/HsuehLW19] and so on. The SGDR used the cosine function combined with periodic function to improve the rate of convergence in accelerated gradient schemes. The CLR is also a periodic function, but its simple just between the maximum and minimum, and the maximum and minimum determination by "LR range test".The HTD just used the tanh function structure a new learning rate schedule. These schemes can improve good performance. And the difference with the k-decay method is that k-decay is a mathematical method, for any derivative function can be used,has wide applicability. Then the new function is still continuous, which mean no other parameters need to be set.

On the other hand, change initialization parameters based on the history of the gradient. For example, some adaptive learning rates algorithm:RMSprob [ruder2016overview], AdaDelta [journals/corr/abs-1212-5701], Adam [kingma2014adam]. These methods improve the stochastic gradient descent based on the momentum. It can be acceleration convergence and reduce the step of the learning rate. But it’s not a contradiction with learning rate schedule, they can be used together.

3 k-decay For Learning Rate Schedule

Because of the k-decay method is a mathematical method, take polynomial function for example to derived new polynomial function based on the k-decay method.

The polynomial decay function is

(3)

which t is the current times, T is the total times, L0 is the maximum learning rate, Łe is the minimum learning rate, N is the index.

The equation 2 change in time, is too complex.So consider not change in time, is constant :

(4)

Now begin with ,the new polynomial function is

(5)

which at , is additional terms by k-decay.

Used series expansion for :

For convenience and consider the boundary conditions to let be :

(6)

Now only need to find three parameters , and .

Solution the order of the equation 6:

(7)

First the function:

satisfy decay conditions : and .Second let get

That is constant but change with . This function can be as a special solutions of the equation 6. Form to

Substituting into equation 6:

From boundary conditions , solve , :

put , and into 6:

(8)

which .

When , need to increase the change rate of the learning rate in the late, the function taken minus. When , need to decrease the change rate of the learning rate in the early, the function taken plus. Simplification the function:

(9)

Because the is monotone decreasing function, used the replace .

(10)

for .

The function named the polynomial function of the k-decay. The function only a special solutions on function 2. The of k-decay method means that the order derivative of function . With the increase,the change effect of the function L(t) increase on the higher order function. named the decay factors of the k-order based on k-decay. For a decay function, it can be used structure a new function.For example, with the cosine function and exponential function, replaced with .The new cosine function is

(11)

The new exponential function is

(12)

Figure 3: diff on the polynomial function of k-decay ().

Figure 3 show the difference between the original polynomial function and the new polynomial function based on k-decay. The original function is a special case at . The change rate of the learning rate increased in the late with the increase of , but the change rate of the learning rate decreased in the early with the increase of .

4 Experiments

Model CIFAR-10 CIFAR-100
ResNet-110 6.37 25.21*
ResNet-110 (ours) 5.26 24.48
ResNet-164 5.46 24.33
ResNet-164 (ours) 5.03 23.54
Wide ResNet-16-8 4.81 22.07
Wide ResNet-16-8 (ours) 3.73 20.18
Wide ResNet-28-10 4.17 20.50
Wide ResNet-28-10 (ours) 3.59 18.43
DenseNet-BC-100-12 4.51 22.27
DenseNet-BC-100-12 (ours) 4.19 22.08
DenseNet-BC-250-24 3.62 17.60
DenseNet-BC-250-24 (ours) 3.58 17.07
Table 1: Test Error rates (%) on CIFAR10 and CIFAR100 datasets.Results that the overall best results are bold. * indicates results run by ourselves. Our results denoted by ,5.00 means the erro rates(%), subscript 3.5 denoted on the polynomial function.

The experiments evaluate our method by the new polynomial function9 on CIFAR-10 and CIFAR-100 datasets with different deep neural network architecture(ResNet [he2016deep] [he2016identity], Wide ResNet [zagoruyko2016wide] and DenseNet [inproceedings] ).Implementations of this model is the same as the original paper, except the learning rate schedule that using the new polynomial function() based on k-decay. And the value of the at intervals of 0.5 form 1.The is 0.1, is 0.001.

The main results on CIFAR10 and CIFAR100 are shown in Table 1. The new function based on the k-decay method make all the best state-of-the-art results are bold.Y ou can seen our method get all the best results on the dataset. In the CIFAR10 data set, ResNet-164 network is improved 0.43% accuracy rate, while Wide ResNet-16-8 and Wide ResNet-28-10 reduced the error rate by 0.72% and 0.58%, the DenseNet-BC-100 reduces the error rate by 0.32%. On the CIFAR100 data set, our method improves the accuracy rate by 0.79% on ResNet-164, and Wide ResNet-28-10 improves the accuracy rate by 1.69%. As mentioned the DenseNet-BC-250 network increased the accuracy of the CIFAR100 datasets by 0.53%.

5 Discussion

The impact on performance. Figure 4 shown relation between error rate and the value on different depths residual neural networks (ResNet-47, ResNet-101).Find that k begin with 1, the accuracy of the model increase. However, when greater than threshold value , its performance will decrease.In ResNet-101, when is greater than 4, The accuracy is smaller than before, but it is still better than , the should be 4; in ResNet-47, the accuracy rate is very low when , lower than , the should be 7.

In conclusion, the threshold value of the model will be decreasing with the increase of the model’s depth. It can reflect that the large network is more sensitive to the learning rate decay than the small network. It also shows that in the early training the learning rate decay is not the faster the better. The model in the late more sensitive to the change rate of the learning rate than the early on training.So increase the later change rate of learning rate important than the early.

On the new polynomial decay function () of the k-decay has a certain range from to make the model the performance improved, and as is greater than kv, this improvement will decrease or even disappear. The threshold is different for different models and different datasets, this requires some experience in the experiment.At the paper, recommended value range was .

Figure 4: Test erro rate (%) on CIFAR10 datasets relation with .When is the original function.
Figure 5: The loss function of the ResNet, Wide ResNet, DenseNet during training period.the of blue line more than the red line.The larger the,the larger the change range of the loss function in the late.The blue line alway above the red line.

The impact on training. According to Figure 5, discover that the loss function change rate in the late with an increased of k, is due to the change rate of the learning rate in the late on the new function with increased of k. Because of this reason make the error rate of the model lower than . It also shows how to distinguish the value of in the loss function by the change rate of the loss function.

The large means that the change rate of the learning rate in the early and middle periods will be smaller, resulting in the loss function, the loss in the medium term is lower than small k. But the change rate of the learning rate in the late will be larger, caused by the loss of rapid convergence to catching up the loss in the medium term. But when k is too big, the loss in the medium term is too low, the loss rapid convergence not to catching up the loss in the medium term. This is the reason for the performance degradation when the value is greater than the threshold value .

Based on the above discussion, also explained why the change rate of the error rate with increased of k and why the error rate decrease with k in Figure 2.

6 Conclusion

This paper proposed a new method adjustment the learning rate schedule for any derivable function based on their derivative of order k to production new function by k-decay. A case study of polynomial function used the k-decay method to derive a new function. And give the decay factor based on k-decay. And introduce new hyper-parameter to control the change rate of the learning rate in the new function. In the k-decay method, the means that the derivative of order , it change the change rate of the learning in k order and the change effect increase with the increase of k. The experiment shows the performance of the model also increases with the increase of . This experiment k-decay method is effective. In the part of the discussion, discuss the k influences the performance and optimization process. In a way, it was also verified the performance of the model like loss, accuracy with the change rate of the learning rate. This is the basic reason for the k-decay method of success. You can see that there are too many solutions on the function 2 , only find a special solution for polynomial function, but is a better solution for that?It should be more.Hope this research will enlighten you to work in this direction.

References

The polynomial function of the k-decay to tensorflow and python

def learning_rate_schedule(
        t,
        L0=0.1,
        Le=0.001,
        T=total_batches_num_all_epoch,
        N=lr_N,
        k=lr_k,
):

    return (L0 - Le) * (1 - t**k / T**k)**N + Le