Auto-Ensemble: An Adaptive Learning Rate Scheduling based Deep Learning Model Ensembling

03/25/2020 ∙ by Jun Yang, et al. ∙ Huazhong University of Science u0026 Technology 10

Ensembling deep learning models is a shortcut to promote its implementation in new scenarios, which can avoid tuning neural networks, losses and training algorithms from scratch. However, it is difficult to collect sufficient accurate and diverse models through once training. This paper proposes Auto-Ensemble (AE) to collect checkpoints of deep learning model and ensemble them automatically by adaptive learning rate scheduling algorithm. The advantage of this method is to make the model converge to various local optima by scheduling the learning rate in once training. When the number of lo-cal optimal solutions tends to be saturated, all the collected checkpoints are used for ensemble. Our method is universal, it can be applied to various scenarios. Experiment results on multiple datasets and neural networks demonstrate it is effective and competitive, especially on few-shot learning. Besides, we proposed a method to measure the distance among models. Then we can ensure the accuracy and diversity of collected models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optimizing structure of network and loss function is a NP-hard process

Blum and Rivest (1992)

. To enhance the generalization capabilities of models, different network structures have been designed to apply to different scenarios. Hence, manually designed network structures are often highly targeted. According to different tasks, it often requires to redesign or optimize the network structure deeply to maintain generalization performance in the new scenarios, which spend a large amount of manpower and computing resources. Therefore, Neural Architecture Search (NAS) is proposed as a new method to construct powerful models. NAS searches network structure automatically and frees up expert time. However, NAS must consume large training budget to acquire the “best” network structure. Furthermore, NAS cannot guarantee the performance and generalization of model in the aspect of loss function and training algorithm. So, it only applies in the field of routine supervised learning with a large amount of labeled data

Elsken et al. (2019); Fang et al. (2019); Jin et al. (2019); Zoph and Le (2017)

. Hence, NAS is rare to involve in other machine learning fields like semi-supervised learning, few-shot learning, etc. Besides, fine-tune can also achieve better generalize, which requires significant expertise. The above methods are not universal enough, for the same problem, ensemble learning is widely used to solve the problem of accuracy and generalization in machine learning applications

Zhou (2012)

. Traditional ensemble learning such as Random Forest and AdaBoost which can hardly extract features about a certain task. And the feature engineering is highly based on manual selection.To avoid huge training budget and complicated feature engineering, this paper attempts to provide a deep learning based simple and automatic ensemble method, which called Auto-Ensemble (AE), to improve performance and generalization. The key idea is by scheduling learning rate to automatically collect checkpoints of model and ensembling them in once training.

Adaptive ensembles like AdaNet, make use of NAS models and automatically search over a space of candidate ensemblesWeill et al. (2019). Auto-Ensemble(AE) differs from it but is also an auto searching process. By scheduling the learning rate, AE searches in the loss surface to collect checkpoints of model for ensemble. Compared to AdaNet, it reduces computing resources, has considerable improvement and it’s easier to implement.

AE makes use of adaptive cyclic learning rate strategy to achieve Auto-Ensemble. Cyclic learning rate strategy takes advantage of SGD’s ability to avoid or even escape false saddle points and local minimum, by simply scheduling the learning rate, it can make the model converge to a better local optimal solution in a shorter iteration period Bengio (2012)

. However, Cyclic learning rate requires manual setting of many hyperparameters, cannot guarantee the diversity of the collected checkpoints of model. Based on this, we used an adaptive learning rate strategy with less hyperparameters, which can automatically collect as many checkpoints of model as possible with high accuracy and diversity in once training. In order to ensure the diversity of the checkpoints collected each time, we proposed a method to measure the distance between checkpoints, which can ensure that the model converges to different local optimal solutions in the process of continuous training.

The main contributions of this paper are:

• An easy-to-implement methodology for ensembling models automatically in once training. In addition to traditional supervised learning, our experiment on few-shot learning demonstrates that Auto-Ensemble can apply to other deep learning scenarios.

• We proposed a method to measure the diversity among models, by which we can ensure the accuracy and diversity of models in the training process.

• Our experiment demonstrates the efficiency of Auto-Ensemble. The accuracy of the classifier can be significantly improved and greatly exceed single model. These greatly reduces the workload of manual designed and optimized network models.

We organize the paper by first describe the significance and overview of Auto-Ensemble method. In the following section we briefly introduce the related work of our methodology. And explain each section of Auto-Ensemble method in detail in Section 3. In Section 4 we demonstrate our setup and the results in experimental procedures. In Section 5, we conclude the advantages and the future work in our research.

2 Related Work

Classification ensemble learning techniques have demonstrated powerful capacities to improve upon the classification accuracy of a base learning algorithm Zhou (2012). A common feature of these approaches is to obtain multiple classifiers by the repeated application of basic learning algorithms to training data. To classify a new sample, we need to obtain the classification results of each classifier, and then aggregate the voting results to get a final classification, this typically achieving significantly better performance than an individual learner Webb and Zheng (2004). In some scenarios, the integration of simple models can achieve comparable results with complex models, and greatly reduce the computational cost.

Training deep neural network is a process of training loss function composed of a corpus of feature vectors and accompanying labels. Li et al.

Li et al. (2018) proposed a visualization method of depth neural network loss surface, which found that the more complex the network, the more chaotic the loss surface is. And it is also found that there are many local optimal solutions in the large loss surface Goodfellow and Vinyals (2015); Li et al. (2018). Ensemble learning makes fully use of these different local minima Huang et al. (2017).

By scheduling learning rate, model can converge to different optimal solution. Once the model encounters a saddle point during training, it can quickly jump out of it by increasing the learning rate. The related work of cyclic learning rate(CLR) proves that the CLR schedule can make the convolutional neural network model training process more efficient. And eliminates the need to perform numerous experiments to find the best values and schedule

Smith (2015).

Research shows that by gathering outputs of neural networks from different epochs at the end of training can stabilize final predictions

Xie et al. (2013). Checkpoint ensemble provides a method to collect abundant models within one single training process Chen et al. (2017). It greatly shortened training time that ensemble requires and achieved better improvements than traditional ensembles.

Snapshot Ensemble combined the cyclic learning rate with checkpoint ensemble: it adopted a warm restart methodHuang et al. (2017)Loshchilov and Hutter (2017)Wen et al. (2019), where in each restart the learning rate is initialized to some value and is scheduled to decrease follow a cosine function. At each end of cycle, they save the snapshots of model. Multiple snapshot models can be collected in one training, which greatly reduce the training budget. Wen et al.Wen et al. (2019) proposed a new Snapshot Ensemble method and a log linear learning rate test method. It combines Snapshot Ensemble with appropriate learning rate range, which outperforms the original methodsHuang et al. (2017)Smith (2015). FGEGaripov et al. (2018) proposed a method for quickly collecting models, which finds paths between two local optima, such that the train loss and test error remain low values along these paths. A smaller cycle length and a simpler learning rate curve can be used to collect a model with high accuracy and diversity along the learning curve.

For the ensemble accuracy depends on the number, diversity, and accuracy of individual models, adaptive ensemble aims to find an optimum condition for ensemble learning. InoueInoue (2019) proposed an early-exit condition based on confidence level for ensemble. It automatically selects the number of ensembled models, which reduce the computation cost. This inspires us to collect models automatically during training.

According to Bengio Bengio (2012), by scheduling the learning rate, the model can find as many different local optimum solutions as possible when exploring the loss surface. Auto-Ensemble method refers to cyclic learning rate schedule. Every time the checkpoint of model is collected, the learning rate rises to escape the local optimal solution. Finally, all the checkpoints of model are used for ensemble to improve the generalization performance of classification. The training epochs, the range of learning rate and the number of training models are adaptive and unpredictable.

Ju et al.Ju et al. (2018) compared relative performance of various ensemble methods, they found that a special ensemble method: Super Learner, achieved best performance among all the ensemble methods. It is cross-validation based ensemble method, and it uses the validation set of the neural networks for computing the weights of Super Learner. Our AE method refers to its idea and proposes a weighted average method, which would help improve the prediction.

3 Auto-Ensemble

Auto-Ensemble(AE) proposed in this paper is scheduling learning rate to control the process of model exploring loss space. After having collected a checkpoint of model, the learning rate rises to escape from it. And start a new searching process. Finally, all the collected model checkpoints are used for ensemble. Our method can explore as many models as possible with high enough accuracy and diversity.

3.1 Description of Model Diversity

Figure 1: The simulation of the loss surface

The main thing is to solve the problem of model diversity. Huang et al.Huang et al. (2017) have discussed the correlation of collected snapshots, and reasonably chose the snapshot models for combination. Auto-Ensemble ensures the diversity of collected checkpoints: Adaptive learning rate schedule can automatically find the local optimal solution by scheduling the learning rate during training process. After having collected a checkpoint of model, by steeply increasing the learning rate, it can automatically jump out of the local optimal solution, then continue to search for other optimal solutions. In Figure 1(a) shows the procedure of collecting snapshots of SnapShot Ensemble (SSE)Huang et al. (2017). In Figure 1(b), the blue arrow line is the convergence process of Auto-Ensemble. It can be seen that model escapes sharply from the local optimal solution and then converge to different local minima. The short arrow line is the convergence process of the traditional SGD, which is slow and inefficient to find a local optimal solution.

3.2 Metric of Model Diversity

Figure 2: Auto adaptive learning rate schedule

We found that the weights and biases of the last dense layer (the last but one dense layer of Siamese network) can be extracted to measure the distance between models. We record two Euclidean distances and , where is the distance between the weight when model converges to a local optimal solution and the weight when the learning rate rises to the highest in the previous cycle, is the distance between the weights of the checkpoint at the local optimal solution and the weights when the learning rate rises to the maximum in current cycle(Figure 2). The arrow lines show two measuring distances during the search of loss surface. To ensure the distance among collected checkpoints, should be much greater than .

Figure 3: Auto adaptive learning rate schedule

We compared the Euclidean distance between the weights of models with the traditional correlation coefficient method on ResNet models in Figure 3(a): The distance among models’ last dense layer: to make the comparison more intuitive, we normalized the distance among models, mapping distance 0 to 1 and the maximum distance to 0.9 (corresponding to the maximum and minimum value of the coordinate axis). The distance value y is converted according to the following formula:

(1)

where and are the coordinate axis boundary of correlation(Figure 3b), x is the actual distance between the weights and , are its maximum and minimum, respectively. Figure 3(b): The correlation coefficient among models. It is observed that the farther the model is, the greater the distance (the smaller the normalized distance value) and the smaller the correlation, preliminary proved that the distance between weights can be used to measure the diversity of models.

3.3 Learning Rate Schedule

The learning rate schedule use the piecewise linear cyclic learning rate schedule refers to Garipov et al.Garipov et al. (2018). We set a learning rate boundary (). The values of and are quite different (generally two orders of magnitude), where is to speed up the gradient descent process, while is to make the model converge to a wide local optimal solution. In the cycle of collecting one checkpoint, the learning rate decreases linearly (change rate ), and then keeps it at the minimum until model converges. Formally, the learning rate lr has the form:

(2)

where n is the total number of training iterations, change rate . Then learning rate increases linearly. Learning rate rise phase is divided into two parts: rapid rise phase and loss surface exploring phase. The learning rate lr has the form:

(3)

where M is the total number of training iterations from the beginning till now. m is the length of the rapid rise phase. is the learning rate in the end of the rapid rise phase. The change rate of learning rate() in the rapid rise phase is the largest, which aims to make the model jump out of the current local optima quickly. The change rate of learning rate subsequently() declines to explore the loss surface more carefully.

3.4 Auto-Ensemble

0:    LR bounds ,, LR change rate , number of iterations n, epochs of rapid rise phase m, ratio of to :Pretrain phase:adopt 75% of epochs to run standard learning rate schedule.
0:    
  repeat
     repeat
        
        if  and the model has not converged then
           
        end if
     until the model is converged and then collect the checkpoint, record number of iterations M
     repeat
        for n in M+m(rapid rise phase) do
           
        end for, record the current learning rate
        
     until , the current cycle is over
  until satisfy the conditions for training stopping
  Ensemble phase: For each model:, get the predicted softmax output , where T is the total number of collected checkpoints, x is the training data. Define a fully collected network H(x) to train the weight: The weighted averaging result is: where is the weight of individual learner
Algorithm 1 Auto-Ensemble Algorithm

The procedure is summarized in Algorithm 1. Before starting to schedule the learning rate, we adopt a pretrain phase. Warm start plays a key role in machine learning: before training a model, if the model is pretrained for a period, the experimental results tend to be significantly improved. After having collected several model checkpoints, the ensemble prediction at test time is the average of every model’s softmax outputs. In addition, in order to improve the efficiency of ensemble, we have designed a weighted averaging method, each model is weighted with different weights.

Simple averaging means average the softmax output of each model, while weighted averaging gives weights to the output of each model.The weight is generally learned from the validation set, which is generated from the training set through data augmentation, etc. Then we use test set to test accuracy. Our algorithm designs a small fully connected network with one layer automatically to learn the weights. It worth mentioning that the bias of fully connected network is fixed to ZERO. The weight is initialized as a one-dimensional vector with length T, where T is the number of collected checkpoints. When training fully connected network, the training data is the 1D array and the label is the original label. After training, the weight of layer is used to ensemble checkpoints.

The above method can improve the ensemble accuracy obviously and smooth the uneven distribution of model accuracy.

4 Experiments

We demonstrate the effectiveness of Auto-Ensemble on different datasets and networks, we compare our method with related state-of-the-art. And run all experiments with Keras.

4.1 Dataset

Cifar

The CIFAR-10 and CIFAR-100 are labeled subsets of the 80 million tiny images dataset Krizhevsky and Hinton (2009). The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. We used a standard data augmentation scheme in Keras document .The augmentation scheme is different in VGG, compared with ResNet and Wide ResNet.

Omniglot

The Omniglot dataset was collected by Brenden Lake and his collaborators at MIT via Amazon’s Mechanical Turk to produce a standard benchmark for learning from few examples in the handwritten character recognition domainLake et al. (2011). The Omniglot dataset consists of 1,623 handwritten characters, with only 20 samples per class. The dataset was divided into training set and test set: 964 classes for train and 659 classes for test.

4.2 Training Setting

4.2.1 Architecture

We test several classic neural networks: including residual networks (ResNet)He et al. (2016), Wide ResNetZagoruyko and Komodakis (2016) and VGG16Simonyan and Zisserman (2014). For ResNet, we use the original 110-layer network, for Wide ResNet we use a 28-layer Wide ResNet with widening factor 10. We use the same standard data augmentation scheme on CIFAR10 and CIFAR100.

4.2.2 Hyperparameters

Our experiments used following hyperparameters: For Wide ResNet we set the learning rates: , ; for ResNet , ; for VGG , . For few-shot learning the learning rate boundary is 0.005-0.03. For distance measurement, was set to 1.5 for VGG, Wide ResNet and ResNet. For few-shot learning was set to 2. For pretraining epochs, Resnet, Wide Resnet and VGG have 110, 45 and 100 pretrain epochs respectively. In few-shot learning, the pre-training phase contains 45000 tasks. We will discuss the effects of these parameters on the experimental results in the following sections.

4.3 Baseline and Comparison

4.3.1 Comparison in Model Collection

Auto-Ensemble has a unique learning rate scheduling method to collect models. To illustrate our advantage, different learning rate schedulers are implemented. We collected checkpoint ensemble models(CE) and random initialization ensemble models(RIE) with traditional learning rate schedulersChen et al. (2017). Besides, Cosine Cyclical Learning Rate scheduler (SSE or CCLR)Huang et al. (2017), Max-Min Cosine CLR (MMCCLR)Wen et al. (2019) and Triangular CLR(FGE)Garipov et al. (2018) are also used for comparison. The baseline model is a single independently trained network (Ind) using stepwise decay SGD.

It is worth mentioning that for fair comparing with state-of-the-art, we reimplement above methods, using the same architecture and data augmentations. All parameters of them follow those in paper when we reproduce them. For CE and RIE, we reimplement these methods based on our datasets and models. For SSE, we totally reproduce experiment according to its paper. We didn’t reproduce MMCCLR for its learning rate scheduler is similar to SSE. For FGE, we only refer to its Triangular CLR as stated in its paper(its implements of curve finding experiments changes the way of gradient update of model to some extent).

4.3.2 Comparison in Ensemble Method

After having collected ensemble models, there have been some rules to combine them together to make predictions. We reimplemented Adaptive Ensembles(CIs)Inoue (2019), which automatically select number of models for ensemble: we used the confidence-level based early-exit condition with a 95% confidence level for all datasets.

And the weighted averaging method are compared with Super Learner(SL). We referred to Ju et al.Ju et al. (2018) and directly extracted the experimental result from them. The results show that our weighted averaging method can make the ensemble more robust.

4.4 Auto-Ensemble on Supervised Learning

4.4.1 Ensemble Result

ResNet110 WRN-28-10 VGG16
Method C10 C100 C10 C100 C10 C100
Ind 94.18 73.21 94.28 74.74 92.92 69.35
RIE 94.51 0.33 74.36 1.15 95.61 1.33 77.47 2.73 93.7 0.78 70.29 0.94
SSE 94.03 0 71.59 0 95.55 1.27 76.38 1.64 93.19 0.27 70.28 0.93
FGE 93.74 0 72.43 0 94.21 0 76.12 1.38 93.26 0.34 69.53 0.18
CE 94.11 0 73.92 0.71 95.05 0.77 77.17 2.43 92.8 0 71.15 1.8
CIs 94.37 0.19 74.46 1.25 95.22 0.94 77.41 2.67 93.63 0.71 70.56 1.21
AE 94.87 0.69 76.55 3.34 95.91 1.63 77.71 2.97 93.61 0.69 71.29 1.94
AE(full) 95.05 0.87 77.18 3.97 96.09 1.81 79.26 4.52 93.93 1.01 72.16 2.81
Table 1: Ensemble Accuracys and Improved Accuracys(%) on CIFAR-10 and CIFAR-100 datasets

All the results are summarized in Table 1. For each method, we used simple averaging methods to ensemble models. And we list the improved accuracy compared to single model(Ind). For AE we show the weighted averaging result. These results are obtained by a fully connected network trained by validation set. We set the learning rate at the range of 0.01-0.001 and select appropriate learning rate that maximizes ensemble accuracy.

Our Auto-Ensemble (AE) results were compared with SSE, FGE, CE, RIE, Adaptive Ensemble based Confidence Intervals (CIs) and independently trained networks (Ind). The best ensemble results are

bolded in Table 1. Experiment showed that in most cases, the ensemble accuracys of Auto-Ensemble were better than other methods. And compared with single model, the improved accuracy is considerable. SSE, FGE and CE sometimes have no improvement because of the poor diversity.

Ensemble method Ensemble Result (%)
Unweighted Average 93.54
Super Learner 88.0
Distribution of Collected Models
Table 2: Resnet110 Checkpoint Ensemble (CE) performance on CIFAR-10 for varying ensemble method.

We also add several SL results to compare the effectiveness of our ensemble method. The results are directly extracted from the original paper(Table 2). Clearly, our weighted averaging method takes advantage of SL and achieves best combination.

4.4.2 Diversity of Models

Figure 4: Auto adaptive learning rate schedule

To illustrate the higher diversity of our models, we calculated the correlation of the softmax output of each pair of models. Figure 4 shows the correlation coefficients among models collected by different methods: Figure 4(a) stands for Auto-Ensemble(AE) with adaptive learning rate cycle. Figure 4(b) shows Snapshot with cosine annealing cycles. Figure 4(c) is RIE with independently trained networks. The correlation of Snapshot is higher than that of the other two methods, indicating that there is less diversity between models. Although AE’s diversity of models is not as good as RIE, compared with RIE, it reduces training budget and can collect models with enough diversity.

4.4.3 Training Budget

Our Auto-Ensemble has unfixed time and storage complexity: storage complexity depends on the diversity of collected model. As stated in our paper, there is a hyperparameter to adjust the distance among models. The number of collected models can’t be specified because it is related to the value of . The training budget is adaptive for AE: the distance among models is adjustable, and the farther the model is, the more epochs needed to collect one model.

While these are fixed for SSE and FGE: the number of epochs can be specified to collect the target model. In our experiment, the ensemble size of SSE is 5: we ensemble the last 5 snapshots in once training. For CE the storage complexity is large: at each epoch, we save the checkpoint for later ensemble. For RIE, it requires to run a single model separately several times: according to Chen, Lundberg, and LeeChen et al. (2017), they ensemble 5 models. The time complexity is the largest. For adaptive ensemble, we select among RIE models, the time and storage complexity are less than RIE.

Ensemble Method Average Epochs per Ensemble Model Ensemble Size
AE 62 12
SSE 40-80 5
FGE 23 6
CE 1-3 68-150
RIE 200 5
CIs 200 5
Ind 200 1
Table 3: Computational Expenses on ResNet with CIFAR-10

Table 3 shows the average number of epochs required to train a ResNet model on CIFAR10.The comparison of computational expense is almost the same on other models and dataset.

Above all, the order of ensemble size is: , the order of time budget is: . It is worth mentioning that the average epoch of SSE is unfixed: the cycle is 40 but we collect last 5 models for better convergence, so the average epochs is more than 40 and reaches 80(collect 10 snapshots and ensemble the last 5). For CE, we select the first M best models.

4.5 Study of Auto-Ensemble

4.5.1 Learning Rate Boundary

The learning rate boundary in the decline phase is to make the model better converge. However, the learning rate is not limited during the rise phase. We need to set the learning rate boundary and its change rate. The choice of the learning rates refers to methods proposed by SmithSmith (2015): let the learning rate increase linearly within a rough boundary. Next, plot the accuracy versus learning rate. Note the minimum learning rate when training accuracy is significantly improved and the learning rate when the accuracy slows, becomes ragged, or even starts to fall. This learning rate boundary is a good choice: set as the maximum learning rate and as the minimum learning rate.

Figure 5: Auto adaptive learning rate schedule
Learning Rate Boundary Ensemble Result (%) Distribution of Collected Models (%)
0.5- 93
0.4-0.01 93.68
0.6-0.01 93.08
Table 4: VGG16 Auto-Ensemble performance on CIFAR-10 for varying learning rate boundary.

In Figure 5, the model starts converging right away, so it is reasonable to set , the accuracy rise gets rough after learning rate increases to 0.4, so . Inappropriate learning rate ranges affect the accuracy of collected models. Table 4 shows the distribution of the collected VGG model accuracy on CIFAR10 with different learning rate boundary. Obviously, appropriate learning rate boundary can obtain the best accuracy of collected models. 0.4-0.01 is the best learning rate boundary for VGG.

The whole learning rate schedule is divided into three phases: the decline phase, the rise phase 1 and the rise phase 2. In the decline phase, the change rate of learning rate is . N is usually set to , in the decline phase , in the rise phase .We set the learning rate change rate in the decline phase, the rise phase 1 and the rise phase 2 separately to , , , then .

4.5.2 Pretrain

Pretrain Epochs(rounds) Best Model Accuracy(%) Ensemble Result(%)
0 92.08 92.31
40 92.75 93
80 92.49 93.21
120 92.71 92.43
Table 5: VGG16 Auto-Ensemble performance on CIFAR-10 for varying pretrain epochs

It is noteworthy that the effect of the pre-training epochs on the collection of the model: If the pre-training model is under-fitting, the difficulty of collecting the first convergent checkpoint of model will increase, which will affect the subsequent ensemble work. If we start from an over-fitting model, it will also lead to poor accuracy and diversity of the collected models. Table 5 shows the effect of different pre-training epochs on ensemble accuracy of VGG models. It can be seen that 80 is a suitable pre-training epoch.

4.5.3 Parameter of Diversity

Value of Ensemble Result (%) Distribution of Collected Models (%)
1.2 93.16
1.4 93.02
1.6 93
1.8 93.16
Table 6: Distribution of Model Accuracy with different

The learning rate stops increasing when . Table 6 shows the effect of different value of on the collection of VGG models. Experiments show that the model is not sensitive to the value of . But if the distance is too large, it will cause the model to escape too far to converge, and the loss and accuracy become unpredictable. So we limit .

4.5.4 Conditions for Stop Training

The condition for the experiment to stop is a question worth discussing, for the number of model checkpoints needed to collect in the training process is unknown.

Figure 6: Training process of ResNet and Wide Resnet models

Experiments show that it is easy for VGG and Resnet110 to collect models. As long as the training process does not stop, infinite models will be collected. So, the number of checkpoints can be used to limit the conditions for training stopping. Figure 6(a) is the training curve of ResNet: Two nearly coincident curves represent the loss and accuracy changes of training set and test sets. The processes of collecting Wide ResNet and Siamese Network are relatively tough. Figure 6(b) demonstrates Wide ResNet training process: With the increase of training epochs, the learning rate should increase to very high level to escape the local optimal solution, and the accuracy of collected checkpoint becomes much lower. In this case, it can stop training when the learning rate has increased to a certain value, for the accuracy of the checkpoints collected next is not high enough.

InoueInoue (2019) proposed an early-exit method based on the confidence level of local prediction, we reimplemented it and found that the adaptive ensemble result is not as good as that of ensembling all models(RIE). The result is shown in Table 1

4.6 Auto-Ensemble on Few-shot Learning

Our mainly contribution also contains the application in few-shot learning, which indicates that our AE method can apply to some non-traditional supervisory problems. Few-shot learning is not a new research topic, and there have been many state-of-the-art. To verify the effectiveness of Auto-Ensemble, we choose Siamese Network as a baseline for its relatively poor performance. The experimental re-sults show that AE method brings significant improvement.

We used the Siamese Neural Network Koch et al. (2015), but added a dense layer before the last neural layer. Since the weight of this dense layer is used to measure the distance between the models during the training process. The learning rate boundary is 0.005-0.03. The distance measurement was set to 2. The pre-training phase contains 45000 tasks. We experimented on the Omniglot dataset, which is divided into training set and testing set according to Vinyals et al. Vinyals et al. (2016)

Model 20-way 1 shot acc (%)
Matching Network Vinyals et al. (2016) 93.8
Siamese Network Koch et al. (2015) 88.0
Siamese Network with ensemble method 92.2
Table 7: Performance of Auto-Ensemble on Siamese Networks
Figure 7: The comparison of piecewise linear cyclic learning rate schedule and auto adaptive learning rate schedule.
Figure 8: Different situations of two learning rate schedules

Table 7 illustrates the performance of Auto-Ensemble: The accuracy we obtained with weighted averaging (92.2%) has great improvement on an original Siamese Network model, and almost catches up with more complex models Matching Networks Vinyals et al. (2016) (93.8%). Our experiment collected 10 checkpoints of model.

Figure 7 illustrates the use of two methods to collect the checkpoints of the model: Using the cyclic learning rate schedule with 10000 tasks in one cycle, the accuracy of the model does not change much with the learning rate. Due to the instability of the model, the model often gets stuck in the saddle point during the training process, where the model is difficult to converge. Cyclic learning rate may be invalid when model reaches a saddle point (Figure 8(a)). But in Auto-Ensemble, the learning rate jitters to let the model jump out of the saddle point automatically (Figure 8(b)). Adaptive learning rate schedule can fully mine different checkpoints of model.

5 Conclusions

This paper proposed an adaptive learning rate schedule for ensemble learning: by scheduling the learning rate, the model can converge and then escape the local optimal solutions.We pay attention to the improvement of performance rather than the absolute performance. By collecting the checkpoint of models, the ensemble accuracy can greatly exceed accuracy of single model. Besides, we proposed a method to measure the diversity among models, so that we can guarantee the diversity of collected models. In some non-traditional supervise problems, like Few-shot Learning, this method can be used to improve the performance of model simply and quickly. We refer to various related work and compare our method with these methods to analyze the results. To verify the effectiveness of our method, we need more experiment on other networks: DenseNet, and Matching Networks used for few-shot learning. In future work we will focus on the optimization of Auto-Ensemble: how to shorten the unpredictable training process. Since the purpose of the training is to collect as many models as possible, but the time and resources required for training are unpredictable, in order to save computing resources, the training process needs to be simplified.

References

  • Blum and Rivest (1992) Blum AL, Rivest RL (1992) Original contribution: Training a 3-node neural network is np-complete 5(1):117–127
  • Elsken et al. (2019) Elsken T, Metzen JH, Hutter F (2019) Neural architecture search: A survey. J Mach Learn Res 20:55:1–55:21, URL http://jmlr.org/papers/v20/18-598.html
  • Fang et al. (2019) Fang J, Chen Y, Zhang X, Zhang Q, Huang C, Meng G, Liu W, Wang X (2019) EAT-NAS: elastic architecture transfer for accelerating large-scale neural architecture search. CoRR abs/1901.05884, URL http://arxiv.org/abs/1901.05884, 1901.05884
  • Jin et al. (2019) Jin H, Song Q, Hu X (2019) Auto-keras: An efficient neural architecture search system. In: Teredesai A, Kumar V, Li Y, Rosales R, Terzi E, Karypis G (eds) Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019, ACM, pp 1946–1956, DOI 10.1145/3292500.3330648, URL https://doi.org/10.1145/3292500.3330648
  • Zoph and Le (2017)

    Zoph B, Le QV (2017) Neural architecture search with reinforcement learning. In:

    DBL (2017), URL https://openreview.net/forum?id=r1Ue8Hcxg
  • Zhou (2012) Zhou ZH (2012) Ensemble methods: foundations and algorithms. Chapman and Hall/CRC
  • Weill et al. (2019) Weill C, Gonzalvo J, Kuznetsov V, Yang S, Yak S, Mazzawi H, Hotaj E, Jerfel G, Macko V, Adlam B, Mohri M, Cortes C (2019) Adanet: A scalable and flexible framework for automatically learning ensembles. 1905.00080
  • Bengio (2012) Bengio Y (2012) Practical recommendations for gradient-based training of deep architectures. In: Neural networks: Tricks of the trade, Springer, pp 437–478
  • Webb and Zheng (2004) Webb GI, Zheng Z (2004) Multistrategy ensemble learning: Reducing error by combining ensemble learning techniques. IEEE Transactions on Knowledge and Data Engineering 16(8):980–991
  • Li et al. (2018) Li H, Xu Z, Taylor G, Studer C, Goldstein T (2018) Visualizing the loss landscape of neural nets. In: Advances in Neural Information Processing Systems, pp 6389–6399
  • Goodfellow and Vinyals (2015) Goodfellow IJ, Vinyals O (2015) Qualitatively characterizing neural network optimization problems. In: Bengio Y, LeCun Y (eds) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, URL http://arxiv.org/abs/1412.6544
  • Huang et al. (2017) Huang G, Li Y, Pleiss G, Liu Z, Hopcroft JE, Weinberger KQ (2017) Snapshot ensembles: Train 1, get M for free. In: DBL (2017), URL https://openreview.net/forum?id=BJYwwY9ll
  • Smith (2015) Smith LN (2015) Cyclical learning rates for training neural networks. Computer Science pp 464–472
  • Xie et al. (2013) Xie J, Xu B, Zhang C (2013) Horizontal and vertical ensemble with deep representation for classification. CoRR abs/1306.2759, URL http://arxiv.org/abs/1306.2759, 1306.2759
  • Chen et al. (2017) Chen H, Lundberg S, Lee S (2017) Checkpoint ensembles: Ensemble methods from a single training process. CoRR abs/1710.03282, URL http://arxiv.org/abs/1710.03282, 1710.03282
  • Loshchilov and Hutter (2017)

    Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. In:

    DBL (2017), URL https://openreview.net/forum?id=Skq89Scxx
  • Wen et al. (2019) Wen L, Gao L, Li X (2019) A new snapshot ensemble convolutional neural network for fault diagnosis. IEEE Access 7:32037–32047
  • Garipov et al. (2018) Garipov T, Izmailov P, Podoprikhin D, Vetrov DP, Wilson AG (2018) Loss surfaces, mode connectivity, and fast ensembling of dnns. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in Neural Information Processing Systems 31, Curran Associates, Inc., pp 8789–8798, URL http://papers.nips.cc/paper/8095-loss-surfaces-mode-connectivity-and-fast-ensembling-of-dnns.pdf
  • Inoue (2019)

    Inoue H (2019) Adaptive ensemble prediction for deep neural networks based on confidence level. In: Chaudhuri K, Sugiyama M (eds) The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, PMLR, Proceedings of Machine Learning Research, vol 89, pp 1284–1293, URL

    http://proceedings.mlr.press/v89/inoue19a.html
  • Ju et al. (2018) Ju C, Bibaut A, van der Laan M (2018) The relative performance of ensemble methods with deep convolutional neural networks for image classification. Journal of Applied Statistics 45(15):2800–2818
  • Krizhevsky and Hinton (2009) Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech Rep 1
  • Lake et al. (2011) Lake B, Salakhutdinov R, Gross J, Tenenbaum J (2011) One shot learning of simple visual concepts. In: Proceedings of the annual meeting of the cognitive science society, vol 33
  • He et al. (2016)

    He K, Zhang X, Ren S, Jian S (2016) Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  • Zagoruyko and Komodakis (2016) Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Wilson RC, Hancock ER, Smith WAP (eds) Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, BMVA Press, URL http://www.bmva.org/bmvc/2016/papers/paper087/index.html
  • Simonyan and Zisserman (2014) Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
  • Koch et al. (2015) Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop, vol 2
  • Vinyals et al. (2016) Vinyals O, Blundell C, Lillicrap T, Wierstra D, et al. (2016) Matching networks for one shot learning. In: Advances in neural information processing systems, pp 3630–3638
  • DBL (2017) (2017) 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net, URL https://openreview.net/group?id=ICLR.cc/2017/conference