## I Introduction

Convolutional Neural Network (CNN) is a popular machine learning algorithm for image classification because it outperforms any other network architecture on visual data. In this paper, we focus on an online learning scenario where data used for training the CNN comes in batches over time [2016arXiv161001030T, hong2015online]

. A CNN model is a neural network structure with a set of weights which are iteratively learned from training data using methods such as Stochastic Gradient Descent (SGD). The SGD algorithm is parametrized with a learning rate

. A large helps the model to converge faster but increases the risk of diverging [Bengio2012]. A small slows the convergence but may lead to a local minimum.There are two main learning rate evolution strategies: time-based or adaptive. In most time-based learning rate strategies, decreases following a predefined decay function [W8305126]. Cyclical strategies have also been developed, where two boundaries are defined and cyclically varies between them. The disadvantage of these algorithms is that the learning rate path is fixed before training, it cannot be adjusted when necessary.

Adaptive learning rate algorithms such as Adam [adam2014arXiv1412.6980K]

, Nadam (Adam with Nesterov momentum)

[dozat2016incorporating] and AMSGrad [amsgradj.2018on] are recent state-of-the-art algorithms which mainly focus on the convergence speed. Different from SGD which uses only the current value of the gradient to update weights, these algorithms use squared gradient to scale the learning rate and take advantage of momentum by using moving average of the gradients. Nevertheless, Wilson et al. [WilsonRSSR17nips] suggested that adaptive gradient methods do not generalize as well as SGD. These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training [DBLP:journals/corr/abs-1712-07628]. To address this issue, AdaBound [adabound] employs dynamic bounds on learning rates to achieve a gradual and smooth transition from adaptive methods to SGD.Up to our knowledge, E (Exponential)/PD (Proportional Derivative) control [zhao:hal-02115916] is the first adaptive learning rate algorithm which uses control theory to dynamically adapt the learning rate during the learning process. It uses only current gradient as in SGD, but its learning rate is dynamically calculated based on the loss value. During the E phase, that corresponds to the beginning of the training when the loss value is continuously decreasing, is increased each time step by a factor of two. Once the loss stops decreasing, the PD phase takes over and, considering CNN as a dynamic system, computes the control input (i.e. ) based on the CNN’s output (i.e. the loss value).

The above-mentioned algorithms are time-based, in the sense of a periodic computation of the control law regardless its utility. In this paper, we propose two event-based control strategies to reduce the time CNN spends learning ”inefficiently” from data, as well as an extensive evaluation. Moreover, while using event-based mechanisms we should expect for a reduction in the use of resources [Astrom:2008, Durand2009c], without degrading performances [lunze2010sfa] and with stability and robustness guarantees [Marchand2013]. Numerous Event-Based control strategies in the literature are focusing on stability and performance guarantees. Most event-based PID controllers are based on level-crossing triggering of some measuring error (see for instance [arzen1999seb, Durand2009c]) or more generally rely on an event-function based on Lyapunov functions (see for instance [velasco2009ols, Marchand2013]).

The two introduced Event-Based control algorithms are: (i) Event-Based Learning Rate control, which will be implemented to prevent sudden drop of the learning rate when the model is approaching the optimum; (ii) Event-Based Learning Epochs control, which will decide based on the learning speed when to switch to the next data batch.

Our algorithm is evaluated on two classical machine learning image datasets CIFAR-10 and CIFAR-100 [Krizhevsky09]. The results are compared with four best state-of-the-art algorithms: Adam, Nadam, AMSGrad and AdaBound. Our results show that the E/PD combined with the two introduced Event-Based control not only outperforms original E/PD but also converges faster than any other state-of-the-art counterpart.

The article is organised as follows: after a brief introduction of the problem in Section I, we detail the scenario and the system to be controlled (i.e. a CNN) with its input and output metrics in Section II. The contribution, i.e., the two event-based mechanisms, is described in Section III.Section IV contains the experimental setup, results and analysis. The article ends with a conclusion and perspectives for further work in Section V.

## Ii Background

### Ii-a Classical Online Learning Scenario

We consider a dataset with a total number of training instances , each one belonging to a class . The whole dataset is composed of subsets (i.e. batches), is the batch where . Each batch equally contains data instances and will be used to train the model for epochs (i.e. times). At the reception of a new batch, the learning rate algorithm is reset with initial values. Classical online learning scenario is illustrated in Fig. 1.

### Ii-B Convolutional Neural Network and Gradient

Convolutional Neural Network (CNN) is the state-of-the-art learning mechanism for image classification [NIPS2012_krizhevsky]

. CNN neurons functions are parameterized with weights and, eventually, bias. The objective of the learning phase is to make iterative adjustments to these biases and weights to better fit the data. These weights in the CNN are usually updated using Stochastic Gradient Descent techniques (SGD):

where vector

represents the weights vector computed at discrete time instant, is positive and denotes the learning rate.is the loss function. As we are always trying to minimize the loss function, we suppose that there exists an optimal solution of parameters

.### Ii-C Performance Metrics

There exists many metrics to evaluate the performance of a CNN model [li2016performance], we used two of the most classical: classification accuracy and loss value.

For evaluating, machine learning researchers typically prepare a testing dataset which will not be used during the training process. At the end of each training phase (called from now on epoch), the testing dataset is used to evaluate the model by measuring the classification accuracy and the loss value. Accuracy is defined as:

(1) |

The loss is defined as the difference between the predicted value by the model and the true value. The most common definition of used for classification problems is cross-entropy [rubinstein1999]:

(2) |

where is the size of testing dataset and

is the total number of classes and also the length of the prediction vector which is a probability vector.

denotes the bit value of prediction vector for data sample while is the ground truth, indicating if data belongs to class () or not ().## Iii Event-Based Control Laws

In [zhao:hal-02115916], an E/PD control of the learning rate is proposed consisting of an increasing phase followed by a PD phase. However, if an increase of the performance can be achieved on both the loss and the accuracy, the learning rate is progressively decreased by the E/PD control in the PD phase, even though a larger value of learning rate would be more efficient in term of performance. Since event-based PID have shown to be more efficient in terms of convergence [Durand2009c], we propose here to implement an event-based E/PD controller to control the learning rate. [zhao:hal-02115916] also shows that significant improvements in terms of accuracy and loss only occurred at the first epochs of training each data batch, so after this stage there is no limited interest into continuing the learning on further epochs. Therefore, we propose a second event-based control to adapt the data batch loading process.

### Iii-a Event-Based Learning Rate

A recall of the E/PD Control algorithm from [zhao:hal-02115916] is schematicly presented in Fig. 2. We suggest to look at a CNN training as a dynamical system with the learning rate as controlled input and the loss as measurable output. Initial weights of the CNN are chosen randomly and the initial learning rate is fixed. E/PD learning rate strategy is defined as:

(3) |

as long as (E phase) and

(4) |

from the first instant when to the end of learning process for the data batch (i.e. the PD phase). For the sake of simplicity the loss values are normalized with respect to the initial epoch loss value . and are the proportional and derivative gain detailed in [zhao:hal-02115916].

On top of the PD phase we consider the following event base mechanism where instead of letting the PD-Control compute the rate each time (which might be lowering the learning rate), we propose to update the learning rate only if the loss value increases during the PD-Control phase.

Let us define the event function by:

(5) |

The proposed PD event-triggered control output at time is then:

(6) |

where is the calculated learning rate for epoch , is the corresponding loss for epoch .

Note that the stability of CNN is ensured by E/PD, whose stability analysis is provided in [zhao:hal-02115916]. Proposed event-based control does not introduce any instability because if , which means the loss is decreasing, model is converging, and if , the learning rate strategy returns to E/PD.

### Iii-B Event-Based Learning Epochs

#### Iii-B1 Controller Design

As observed in [zhao:hal-02115916], significant improvement in the learning only occurs at the beginning when loading a new batch, the accuracy and loss value evolve slowly afterwards. This motivates the use of an event-based strategy on the loss value record.

Consider a maximum of training epochs within each batch. Let vector contains the latest epochs numbers and vector contains the latest corresponding normalized loss values:

where

. One can use least squares estimation to fit a regression line with

and :(7) |

The purpose of this is that if the training process goes well the loss value should always decrease, therefore should always be negative. Even with the presence of loss variations during the training, as long as the decreasing trend doesn’t change,

should still be negative. Nevertheless, in the moment the loss trend becomes flat or even is increasing,

will become 0 or positive.We define the event mechanism by the event function by:

(8) |

which enables to switch to new data batch when the learning speed is too low, i.e. the training is not efficient anymore.

The threshold can be adjusted in order to control the efficiency of learning. This threshold should never be positive as an increasing curve of the loss value is not desirable. With enough computing resources and no time constraints, the threshold can be set close to , and the training will last even though it makes very small improvement. Nevertheless, for online learning the time interval between two data batches can be short compared to the training time and we could encounter the scenario when before we finish the current training epochs the next data batch is already available. In this case, cutting off some useless training can be very useful. Therefore should also be chosen depending on the frequency of batch arrival. The choice of is based on the constraints imposed by the CNN (or the application using CNN). A large value of would imply a long time of inactivity as the controller would react only after epochs (consecutive tests). A small value of would imply that the algorithm is very sensitive to each epoch thus if the event based algorithm becomes a time based one.

#### Iii-B2 Online Learning Scenario

Recall the online learning scenario defined in Sec. II-A and Fig. 1, the difference for Event-Based Learning Epochs is that the training epochs for each batch could be varied but no larger than , but the total training epochs are the same for both scenario for all the experiments of the same dataset. So here we could cyclically learn the data batches until it reaches the total epochs limit. The online learning arrangement for Event-Based Learning Epochs is illustrated in Fig. 3.

## Iv Experimental Evaluation

### Iv-a Experimental Setup

The experiments are implemented on two state of the art machine learning datasets: 1) CIFAR-10 (a natural image data set with 10 categories) and 2) CIFAR-100 (a natural image data set with 100 categories) [Krizhevsky09] with 3 different initial learning rate. The characteristics of the two data-sets are given in Table I. As the CIFAR-100 dataset has more classes, we use a deeper CNN: ResNet [He2016DeepRL] than the one used for CIFAR-10 VGG [simonyan2014very]. Due to the computational resource limitation, for ResNet with CIFAR-100, we train 30 epochs per data batch instead of 60 for CIFAR-10.

Use case | CIFAR-10 | CIFAR-100 |
---|---|---|

#data instances to train T | 50,000 | 50,000 |

#data instances to test V | 10,000 | 10,000 |

#classes C | 10 | 100 |

data batch size | 10000 | 10000 |

total batches | 5 | 5 |

#trainng epochs per batch | 60 | 30 |

All the experiments are implemented with Keras

[chollet2015keras] and are carried out on Google Cloud Compute-Engine using 8 virtual CPU with 30 GB memory and one P100 GPU. Each experiment is repeated 5 times.The parameters and are selected through a process of cross validation on a subset of CIFAR-10. As a small value for leads to high sensitivity and a large slows down the detection of the situation, we predefined a reasonable list of choice . Due to similar consideration of sensibility, we also predefined a list for the learning rate threshold . Each possible pair from these two lists is tested, a good compromise between reactivity and noise sensitivity was found for and .

### Iv-B Evaluation Metrics

The final loss and final validation accuracy (hereinafter referred to as FVA) reveal the performance of the final model. Nevertheless, stability metrics are also important: if accuracy curve experiences a big variance near the end of training process, even we could have a good final result, we could not assure that we always get this result. Thus, in our evaluation, we include standard deviation of the accuracy of the last 10% training epochs

[minaeemetrics] (hereinafter referred to as FASD (Final Accuracy Standard Deviation)). Convergence speed of accuracy is another metric to evaluate the performance, as we will focus on online learning scenario, the interval between two batch data can be short. With a limited time, a faster accuracy convergence could lead to a better model performance comparing to other algorithms. Therefore, we will report the first epoch when the experiment reaches the 95% of best final accuracy among all the experiments.### Iv-C Evaluation of Event-Based E/PD

Event-Based E/PD (hereinafter referred to as EB E/PD) refers to the E/PD control combined with Event-Based Learning Rate control (Sec. III-A). We implement the online training experiments with E/PD and EB E/PD on CIFAR-10. From Fig. 5 we can first see the comparison between EB E/PD and original E/PD (only yellow and dotted blue line for now). For the first 60 epochs, we can see that EB E/PD is more stable than E/PD, then their curves are quite overlapped. The averaged comparison results are showed in Table. II. EB E/PD performs better than E/PD in almost all metrics for all initial learning rate group. Even though EB E/PD has a higher FASD under 0.01 and 0.05 initial learning rate, but the minimum value of FVA(FASD) range of EB E/PD is higher than the maximum value of the range of E/PD.

Algorithm | Final loss | ( ) (%) | 1st epoch to | |

E/PD | 0.002 | 0.58 | 83.17(0.08) | 124/300 |

EB E/PD | 0.002 | 0.56 | 83.81(0.03) | 93/300 |

E/PD | 0.01 | 0.55 | 84.35(0.07) | 88/300 |

EB E/PD | 0.01 | 0.54 | 84.91(0.10) | 75/300 |

E/PD | 0.05 | 0.56 | 85.06(0.12) | 73/300 |

EB E/PD | 0.05 | 0.50 | 85.96(0.26) | 63/300 |

1. FVA: Final Validation Accuracy | ||||

2. FASD: Final Accuracy Standard Deviation | ||||

3. 81.66%: 85.96%(best final accuracy among all the experiments)95% |

For the sake of visibility, we zoom into the 60th to 90th training epochs from our two experiment runs and show the evolution of the loss value and learning rate in Fig. 4. According to the learning rate curve, we know that E phase ends at 62th epoch for E/PD-Control curve, and at 64th epoch for EB E/PD. E/PD-Control curve clearly shows the problem we mentioned above, we can observe that from 62th epoch, the loss of E/PD is continuously decreasing until 70th epoch, and its learning rate is also decreasing during this period. If the learning rate could stay constant during these 9 epochs, its loss would decrease sharply and that would improve the convergence speed. In contrast, EB E/PD keeps the learning rate when the loss continuously decreases which helps to accelerate the convergence. We can also notice that with the drop of the loss, each time when we update the learning rate for EB E/PD, its trend is also decreasing which will guarantee the stability of EB E/PD near the optimum.

### Iv-D Evaluation of Double-Event-Based E/PD

Double-Event-Based E/PD-Control (hereinafter referred to as D-EB E/PD) refers to the E/PD control combined with Event-Based Learning Rate control (Sec. III-A) and Event-Based Learning Epochs control (Sec. III-B). To ensure the need of the Event-Based Learning Rate control, we implemented E/PD with only Event-Based Learning Epochs control; results showed that Double Event-Based E/PD always has a better performance in Final loss and FVA. Due to the page limitation, we exclude these results from the main manuscript, however they are available online as appendices.

D-EB E/PD-Control has been tested on CIFAR-10 and CIFAR-100 and compared with 4 best state-of-the-art adaptive optimization algorithms: Adam, Nadam, AMSGrad and AdaBound. For these 4 learning rate strategies, except varying initial learning rate, all the other parameters remain as default as they mentioned in their paper or coded in Keras. As we adopt Event-Based Learning Epochs control into D-EB E/PD, the training epochs for each data batch is not fixed, we may also iterate each data batch several times. Therefore, we will not only report the results at the end of whole training process, but also the results after first round training (i.e. the training process iterates, for the first time, all the data batches, refer to Fig. 3).

Experimental results on CIFAR-10 are showed in Fig. 5, all the curves are generated with the same initial learning rate 0.01. Between 25th and 60th epoch, D-EB E/PD largely outperforms all the counterparts. The vertical line with arrow at 104th epoch indicates that our D-EB E/PD algorithm has finished its first round learning of the whole 5 batches after this epoch. There are two reasons that we can achieve this performance: (i) EB E/PD converges very fast, (ii) during these epochs, our D-EB E/PD algorithm have trained with later batches data, while other 4 algorithms, they are still working on the first batch data. Diversity of training data helps to reach better performance.

More detail of results on CIFAR-10 is reported in Table. III. D-EB E/PD reaches a higher final accuracy and lower final loss no matter . Even though D-EB E/PD has a higher FASD than AdaBound with and , the FVA(FASD) range of D-EB E/PD is always higher than the range of AdaBound. Additionally it only takes about 32 to 38 epochs to reach 95% best accuracy in any group. All the indicators are very stable across different groups for D-EB E/PD. One can also note that for all the 4 state-of-the-art algorithms, they all perform very bad with , they cannot even reach the 95% best accuracy. We also implemented the same experiments with . Except our algorithm, no other one reaches a reasonable accuracy value, which can be explained by the fact that during the PD phase of E/PD control our learning rate can decrease to a low level while the counterparts can not. Those results are available as appendices.

CIFAR-100 results are reported in Table. IV. According to the FVA, we know that all the algorithms did not totally converge in the end of training process, but that does not influence our conclusion of analysis. D-EB E/PD outperforms other algorithms in almost all the metrics, when its FASD is higher than others in certain groups, its FVA(FASD) range is always higher than others. As the algorithms are not totally converged, the trend of accuracy curve is still increasing, therefore, the higher the initial learning rate, the faster the 1st epoch to reach 95% best accuracy.

Table. V shows the results of D-EB E/PD in the end of first round learning. All the final loss after first round learning in this table is lower than all the state-of-the-art algorithms in their end of whole training process comparing to their own group. Except CIFAR-100 for , all the FVA after first round learning in this table exceed the 95% best accuracy in Table. III and Table. IV, respectively. As the learning process on CIFAR-100 is not totally converged, we can notice that the ending epoch of their first round is near the end of whole training process, our event-based control did not cut off many epochs. But for CIFAR-10, event-based control helps to massively cut off around 62% to 67% training epochs meanwhile guarantee a very good result.

### Iv-E Trade-offs and limitations

The addition of event-based mechanisms improves the performance in terms of final accuracy and loss, however at the cost of two sacrifices: (i) Event-Based Learning Epochs accelerate the speed of learning each data batch. However, if we are not allowed to keep in cache any data batch locally, i.e. only allowed to learn each data batch once, the performance of Double Event-Based E/PD after first round is slightly worse than the performance after all the training epochs. (ii) Double Event-Based E/PD will cyclically learn all data batches, and it will need to load and unload data batch more times than classical online learning setting. Loading (unloading) data into (from) memory needs time. These are extra costs for Double Event-Based E/PD, however negligible compared to the computing intensity of CNNs.

Regarding the limitation of the presented D-EB E/PD, we identified one potential case for which our algorithm will fail: if the training data contains mislabeled data. These data will lead the model to converge to a wrong optimum, and as the algorithm minimizes faster the loss function, it will be faster over-fitting to the noisy data than other algorithms. However, this fail is caused by poor data selection, and is not specific to our algorithm.

Algorithm | Final loss | FVA FASD | 1st epoch to | |

D-EB E/PD | 0.002 | 0.58 | 84.50(0.59) | 38/300 |

Adam | 0.002 | 0.73 | 84.14(1.34) | 64/300 |

Nadam | 0.002 | 0.71 | 83.29(1.11) | 66/300 |

AMSGrad | 0.002 | 0.67 | 84.21(1.65) | 65/300 |

AdaBound | 0.002 | 0.81 | 84.31(0.96) | 75/300 |

D-EB E/PD |
0.01 | 0.61 | 84.83(1.29) | 37/300 |

Adam | 0.01 | 0.79 | 83.98(1.58) | 64/300 |

Nadam | 0.01 | 0.75 | 84.15(1.29) | 65/300 |

AMSGrad | 0.01 | 0.65 | 84.21(1.50) | 72/300 |

AdaBound | 0.01 | 0.84 | 79.22(1.21) | - |

D-EB E/PD |
0.05 | 0.60 | 85.20(3.14) | 32/300 |

Adam | 0.05 | 5.98 | 48.93(14.06) | - |

Nadam | 0.05 | 7.74 | 42.27(13.95) | - |

AMSGrad | 0.05 | 2.69 | 59.74(12.43) | - |

AdaBound | 0.05 | 1.03 | 71.49(1.65) | - |

1. 80.94%: 85.20%(best final accuracy among all the experiments)95% |

Algorithm | Final loss | FVA (FASD) (%) | 1st epoch to | |

D-EB E/PD | 0.002 | 2.59 | 45.69(1.94) | - |

Adam | 0.002 | 3.40 | 31.29(3.23) | - |

Nadam | 0.002 | 3.18 | 35.66(3.35) | - |

AMSGrad | 0.002 | 3.13 | 35.38(4.02) | - |

AdaBound | 0.002 | 3.29 | 39.87(4.42) | - |

D-EB E/PD |
0.01 | 2.41 | 48.14(3.34) | 111/150 |

Adam | 0.01 | 4.94 | 8.11(2.04) | - |

Nadam | 0.01 | 4.55 | 9.70(2.32) | - |

AMSGrad | 0.01 | 4.79 | 8.16(0.50) | - |

AdaBound | 0.01 | 3.51 | 30.98(3.08) | - |

D-EB E/PD | 0.05 | 2.38 | 49.01(10.52) | 100/150 |

Adam | 0.05 | 4.72 | 2.64(0.58) | - |

Nadam | 0.05 | 4.74 | 1.88(0.79) | - |

AMSGrad | 0.05 | 4.68 | 1.98(0.56) | - |

AdaBound | 0.05 | 3.69 | 19.03(2.42) | - |

1. 46.56%: 49.01%(best final accuracy among all the experiments)95% |

Dataset | EE of | FL after | FVA after (%) | |
---|---|---|---|---|

CIFAR10 | 0.002 | 99/300 | 0.60 | 82.47 |

CIFAR10 | 0.01 | 104/300 | 0.62 | 82.36 |

CIFAR10 | 0.05 | 113/300 | 0.62 | 82.75 |

CIFAR100 | 0.002 | 148/150 | 2.61 | 44.98 |

CIFAR100 | 0.01 | 148/150 | 2.44 | 48.04 |

CIFAR100 | 0.05 | 146/150 | 2.41 | 48.95 |

1. EE of FR: End Epoch of First Round | ||||

2. FL after FR: Final loss after First Round | ||||

3. FVA after FR: Final Validation Accuracy after First Round |

## V Conclusion and future work

Due to the limitation of computing resource or short interval time between two data batches, convergence speed of the loss and accuracy becomes especially important for online learning. E/PD control is a powerful learning rate algorithm when training neural network on an online learning scenario. Based on E/PD, this paper proposes two algorithms: (i) Event-Based Learning Rate algorithm and (ii) Event-Based Learning Epochs algorithm.

The new algorithm firstly introduces an Event-Based control on PD phase of E/PD, when the loss continuously decreases, we prevent the learning rate to decrease during this period. Second Event-Based control is implemented to inspect the record of the loss value. If the loss record has the tendency to increase, showing little learning efficiency, we will drop the rest learning epochs for current data batch.

Results show that Double-Event-Based E/PD can massively cut off training epochs, and even results in a lower loss value. For instance with CIFAR-10 dataset, it could save up to 67% training epochs.

As the Event-Based Learning Epochs control is independent from learning rate algorithm and dataset, this work could be further extended by implementing this control with language, image and numeric datasets on time-based decay SGD, Adam, Nadam, AMSGrad and AdaBound learning rate algorithms, to prove that by simply adding this event-based control, all the learning rate algorithms on any dataset can improve their performance on online learning scenario.

Comments

There are no comments yet.