Deep learning [lecun2015deeplearning]
is widely used in image recognition, speech recognition and many other fields. Now the deep neural network have convolutional neural networks for image[lin2013network, hu2017squeezeandexcitation, journals/corr/DongLHT15, goodfellow2014generative]
, recurrent neural network for speech[devlin2018pretraining] and graph neural networks for graph [wu2019comprehensive, kipf2016semisupervised]. In these fields, deep neural networks have obtained unprecedented success, but it is far from enough. With the rapid development of technology, our model becomes more and more complex, need to consume more computer resources.So less computing resources obtain better models is our research purpose. on the one hand, develop an efficient neural network. For example, the architecture of the convolutional neural networks from VGG [simonyan2014convolutional] to resnet [he2016deep], from resnet to DenseNet [inproceedings] , its performance is improved and the calculation is less. on the other hand, find out an efficient optimization training neural networks. For example, replace the SGD with Momentum training higher performance model at the same time.
In deep neural networks update the parameters used by gradient descent algorithms. The formulated is
which is the parameters, is the learning rate,
is the loss function. Thecontrols the parameters update speed. When the is large, the model convergence speed very quickly, but skip local minimum value. When the is small, it can find local minimum value, but the model convergence speed slowly. The learning rate schedule from large decay to small , can be solved the conflict. Usually using the monotone decreasing function for the learning rate schedule. For instance, polynomial decay function, cosine decay function, exponential decay function.
In Figure 1, the test error rate curve is used the polynomial decay function of different exponents as the learning rate schedule to training ResNet-20 () on CIFAR10 dataset. With the increase of
, the change rate of the errors decreases in the late(the epoch at 180~200). The reason was that with the increase of, the change rate of the learning rate decreases with time in the late. When is 2,3,4, the change rate of the learning rate closed to 0, lead to the errors not changed in the late. At , the change rate of the learning rate is slow, caused by the change rate of error rate is slow. According to this, there was a positive correlation between the change rate of the errors and the change rate of the learning rate. At , the change rate of the learning rate is slowly in the late. If an increase the change rate of learning rate schedule in the late at , it will make the change rate of the errors increase in the late to improve the performance. At , the change rate of the learning rate is slowly in the early. If decrease the change rate of learning rate schedule in the early at , it will make the change rate of the errors increase in the early to improve the performance. And how to change the change rate of learning rate on the learning rate schedule is the paper to solve.
This paper proposed the k-decay method to change the change rate of learning rate by change the k order derivative of function. The function of learning rate schedule is , its derivative function is . And the new derivative function is
which at . The is the increment of .It can be control and change the change rate of in the k-order. Solve the primitive function of as the new function for the learning rate schedule. You can see that the number of new functions at least .
In Figure 2, shown test errors on CIFAR10 with ResNet-20 (), used the new polynomial function based on k-decay, the errors decreases with the increase of . The change rate of the errors increases with the increase of in the late on the error curve, caused by the change rate of the learning rate increase with on the new function in the late. The original function is the special case at . It is proved that the change of the learning rate schedule based on the k-decay method be correct.
2 Related Work
In recent years, find many studied optimization based on learning rate. On the one hand, used the better function of the learning rate schedule to role the global learning rate change with time in the optimal algorithm. For example, common use piecewise functions, polynomial function, stochastic gradient descent with restarts(SGDR)[loshchilov2016sgdr], Cyclical Learning Rates(CLR) [conf/wacv/Smith17] , Hyperbolic-Tangent Decay(HTD) [conf/wacv/HsuehLW19] and so on. The SGDR used the cosine function combined with periodic function to improve the rate of convergence in accelerated gradient schemes. The CLR is also a periodic function, but its simple just between the maximum and minimum, and the maximum and minimum determination by "LR range test".The HTD just used the tanh function structure a new learning rate schedule. These schemes can improve good performance. And the difference with the k-decay method is that k-decay is a mathematical method, for any derivative function can be used,has wide applicability. Then the new function is still continuous, which mean no other parameters need to be set.
On the other hand, change initialization parameters based on the history of the gradient. For example, some adaptive learning rates algorithm:RMSprob [ruder2016overview], AdaDelta [journals/corr/abs-1212-5701], Adam [kingma2014adam]. These methods improve the stochastic gradient descent based on the momentum. It can be acceleration convergence and reduce the step of the learning rate. But it’s not a contradiction with learning rate schedule, they can be used together.
3 k-decay For Learning Rate Schedule
Because of the k-decay method is a mathematical method, take polynomial function for example to derived new polynomial function based on the k-decay method.
The polynomial decay function is
which t is the current times, T is the total times, L0 is the maximum learning rate, Łe is the minimum learning rate, N is the index.
The equation 2 change in time, is too complex.So consider not change in time, is constant :
Now begin with ,the new polynomial function is
which at , is additional terms by k-decay.
Used series expansion for :
For convenience and consider the boundary conditions to let be :
Now only need to find three parameters , and .
Solution the order of the equation 6:
First the function:
satisfy decay conditions : and .Second let get
That is constant but change with . This function can be as a special solutions of the equation 6. Form to
When , need to increase the change rate of the learning rate in the late, the function taken minus. When , need to decrease the change rate of the learning rate in the early, the function taken plus. Simplification the function:
Because the is monotone decreasing function, used the replace .
The function named the polynomial function of the k-decay. The function only a special solutions on function 2. The of k-decay method means that the order derivative of function . With the increase,the change effect of the function L(t) increase on the higher order function. named the decay factors of the k-order based on k-decay. For a decay function, it can be used structure a new function.For example, with the cosine function and exponential function, replaced with .The new cosine function is
The new exponential function is
Figure 3 show the difference between the original polynomial function and the new polynomial function based on k-decay. The original function is a special case at . The change rate of the learning rate increased in the late with the increase of , but the change rate of the learning rate decreased in the early with the increase of .
|Wide ResNet-16-8 (ours)||3.73||20.18|
|Wide ResNet-28-10 (ours)||3.59||18.43|
The experiments evaluate our method by the new polynomial function9 on CIFAR-10 and CIFAR-100 datasets with different deep neural network architecture(ResNet [he2016deep] [he2016identity], Wide ResNet [zagoruyko2016wide] and DenseNet [inproceedings] ).Implementations of this model is the same as the original paper, except the learning rate schedule that using the new polynomial function() based on k-decay. And the value of the at intervals of 0.5 form 1.The is 0.1, is 0.001.
The main results on CIFAR10 and CIFAR100 are shown in Table 1. The new function based on the k-decay method make all the best state-of-the-art results are bold.Y ou can seen our method get all the best results on the dataset. In the CIFAR10 data set, ResNet-164 network is improved 0.43% accuracy rate, while Wide ResNet-16-8 and Wide ResNet-28-10 reduced the error rate by 0.72% and 0.58%, the DenseNet-BC-100 reduces the error rate by 0.32%. On the CIFAR100 data set, our method improves the accuracy rate by 0.79% on ResNet-164, and Wide ResNet-28-10 improves the accuracy rate by 1.69%. As mentioned the DenseNet-BC-250 network increased the accuracy of the CIFAR100 datasets by 0.53%.
The impact on performance. Figure 4 shown relation between error rate and the value on different depths residual neural networks (ResNet-47, ResNet-101).Find that k begin with 1, the accuracy of the model increase. However, when greater than threshold value , its performance will decrease.In ResNet-101, when is greater than 4, The accuracy is smaller than before, but it is still better than , the should be 4; in ResNet-47, the accuracy rate is very low when , lower than , the should be 7.
In conclusion, the threshold value of the model will be decreasing with the increase of the model’s depth. It can reflect that the large network is more sensitive to the learning rate decay than the small network. It also shows that in the early training the learning rate decay is not the faster the better. The model in the late more sensitive to the change rate of the learning rate than the early on training.So increase the later change rate of learning rate important than the early.
On the new polynomial decay function () of the k-decay has a certain range from to make the model the performance improved, and as is greater than kv, this improvement will decrease or even disappear. The threshold is different for different models and different datasets, this requires some experience in the experiment.At the paper, recommended value range was .
The impact on training. According to Figure 5, discover that the loss function change rate in the late with an increased of k, is due to the change rate of the learning rate in the late on the new function with increased of k. Because of this reason make the error rate of the model lower than . It also shows how to distinguish the value of in the loss function by the change rate of the loss function.
The large means that the change rate of the learning rate in the early and middle periods will be smaller, resulting in the loss function, the loss in the medium term is lower than small k. But the change rate of the learning rate in the late will be larger, caused by the loss of rapid convergence to catching up the loss in the medium term. But when k is too big, the loss in the medium term is too low, the loss rapid convergence not to catching up the loss in the medium term. This is the reason for the performance degradation when the value is greater than the threshold value .
Based on the above discussion, also explained why the change rate of the error rate with increased of k and why the error rate decrease with k in Figure 2.
This paper proposed a new method adjustment the learning rate schedule for any derivable function based on their derivative of order k to production new function by k-decay. A case study of polynomial function used the k-decay method to derive a new function. And give the decay factor based on k-decay. And introduce new hyper-parameter to control the change rate of the learning rate in the new function. In the k-decay method, the means that the derivative of order , it change the change rate of the learning in k order and the change effect increase with the increase of k. The experiment shows the performance of the model also increases with the increase of . This experiment k-decay method is effective. In the part of the discussion, discuss the k influences the performance and optimization process. In a way, it was also verified the performance of the model like loss, accuracy with the change rate of the learning rate. This is the basic reason for the k-decay method of success. You can see that there are too many solutions on the function 2 , only find a special solution for polynomial function, but is a better solution for that?It should be more.Hope this research will enlighten you to work in this direction.
The polynomial function of the k-decay to tensorflow and python
def learning_rate_schedule( t, L0=0.1, Le=0.001, T=total_batches_num_all_epoch, N=lr_N, k=lr_k, ): return (L0 - Le) * (1 - t**k / T**k)**N + Le