Recent breakthroughs made by deep learning heavily rely on Supervised Learning (SL) with large amount of annotated datasets[law2018cornernet, liu2018original]. But in the practical applications, large amount of labels are expensive and time-consuming [qu2018orienet]. Lack of labels is an important obstacle to adopt SL methods. To achieve similar accuracy to SL with less labels, (pool-based) active learning (AL) [lewis1994sequential] has become a possible solution. These strategies have succeeded in many realms such as image processing [zhou2017fine]
and natural language processing(NLP)[tanguy2016natural].
The goal for active learning is to select the least number of typical samples, and train the model to reach the same accuracy as one trained on all the samples. Most of the previous works select samples once after a whole training process on the existing labeled dataset. It’s not difficult to find out that the core of active learning methods is the strategy of sample selection, called acquisition function. Basing on the learning process of pool-based active learning, the samples selected are expected to be the ones with most information. In many works, the selected samples are the most uncertain ones. The basic ideas include using confidence, max-entropy [shannon1948mathematical], mutual information [houlsby2011bayesian]
, mean standard deviation[kampffmeyer2016semantic] or variation-ratio [freeman1965elementary]
of samples as a measurement. Recent works of AL adopted strategies based on Bayesian Convolutional Neural Networks[gal2017deep] and Generative Adversarial Nets (GAN) [zhu2017generative]. Although the principle of networks is different from typical classification convolutional neural networks (CNN), the methods still generate or choose samples with highest uncertainty. There are another family of methods focuses on multi-outputs to stabilize the inputs of acquisition functions, but the thought behind the methods are still uncertainty-based. Another class of work select samples by the expectation of model change. For instance, expected gradient length [settles2008analysis] choose samples expected to cause largest gradients to current model. After approximation of the algorithm, the selected samples are similar to adversarial examples [Goodfellow2014Explaining]. There are also works concentrates on exploring the typical samples of the whole dataset. For example, core-set [sener2017active] choose samples that are at the center of a neighbor area, and expect all the selected samples to cover the whole feature space.
Present active learning methods are different in strategy and implementation, but we can classify all the methods mentioned above asspatial-based ones. That is, although different methods concentrate on different parts of the AL process (prediction, model updating, etc.), the information took in to account all comes from the prediction of the well-trained models before selection. The whole process is a flat one without information from the time course. Here we propose sequential-based methods, and as a verification of it, we propose a new criterion of sample selection in AL called the prediction stability
, which describes the oscillation of predictions across the epochs. Instead of starting from a well-trained model, we begin the selection process during training the model. We assume that the violent fluctuation of prediction on a sample during training means the fitting ability of model is weak in the feature area of this sample. The results of our experiments agree with our assumption and proves the proposed method as an effective one.
The following parts of this paper is divided into 4 sections. The second and third sections introduce the relation to prior work, and our methodology. The forth section provides the experimental results. And the final part is the conclusion.
2 Relation to Prior Work
When comparing our proposed method with present AL algorithms, there are two major differences. First, our sequential-based method not only extracts features after training, but during the training process. Second, the previously proposed measures of amount of information are based on more apparent criteria including uncertainty, the influence on model and looking for typical samples. They care more about the scale of features, but prediction stability is a new criterion to catch the indirect information of relative prediction changes.
We can define the dataset of all samples as , with representing the labeled set containing labels, and is the set of unlabeled samples. The budget of AL is defined as . For pool-based active learning, after initialization, in each round of AL, the model will select samples from for annotation and put the set of them into , then the model is retrained on the new set. In previous works, the acquisition functions can be concluded as (1). In this equation,
is the feature extracting function, andoutputs the scores of samples.
The past spatial-based methods concentrate on the quality of final predictions. All the innovations focus on the measurements of the final prediction. Different from this kind of methods, we propose sequential-based methods that make use of the information during training. Defining number of epochs in training as , and as the function in n-th epoch, the acquisition function can be rewritten as eq.2.
As an application of sequential-based methods, we propose prediction stability, a new criterion of selecting the subset
in active learning. For implementation, we also adopt the common CNN model as the feature extractor and classifier. An important distinction with former spatial-based methods is that, this criterion focus not on the real scales of feature vectors, but the fluctuation of scales during training. As Fig.1 shows, looking through the whole training process, features of samples like (a) tend to be relatively stable, but other samples like (b) oscillates from the beginning to the end. An instinct speculation is that samples like Fig.1
(b) should be selected for labeling. In order to do quantitative analysis, we test some common-used measures of fluctuation of data, and choose variance of feature vectors of different epochs as the measure of prediction stability. The diagrams in Fig.1 also shows that, due to under-fitting, the former epochs of training are definitely to violate severely. Therefore only epochs in the later training process should be included in the calculation. After experiment, we find that the selected epochs are actually at the over-fitting area, which is relatively stable. Also, considering the time complexity, only several epochs are chosen in the end.
The definition of prediction stability can be written as eq.3:
Where is the length of predicted feature vector by , is the c-th element of feature vector , and is the set of index of selected epochs. The choice of is discussed in the experiment part. The whole framework is displayed in Algorithm1.
4 Experimental Results
4.1 Implementation Details
CIFAR-10 and CIFAR-100 [krizhevsky2009learning] are used for the evaluation of our proposed method. The samples of the two datasets are all small image patches. Each dataset contains 50000 training samples and 10000 testing samples respectively. The training and testing samples are equally distributed into all categories. But the difference is that CIFAR-10 only has 10 classes, and CIFAR-100 contains 100 classes. Therefore, sample size in each class of CIFAR-10 is 10 times of that of CIFAR-100.
4.1.2 Architecture details
As for the model for feature extraction, we employ ResNet-18 [He_2016_CVPR]
, which is a relatively deep architecture, and a popular choice among recent works on AL. Basically, this network mainly consists of the first convolution layer and the following 4 residual blocks. The implementation is based on an open source framework111https://github.com/bearpaw/pytorch-classification.git. The softmax output of network, which is the final score vector of categories, is chosen as the feature vector in this work.
All the models in this work are implemented on a NVIDIA TITAN Xp GPU. During training, the batch size is 128, and 164 epochs are utilized in each training process. In our experiments, for each dataset, a subset containing 1000 samples is selected for the first training process. Since biases of number among different classes in the initial labeled dataset may heavily influence the selection after the first training process, equal number of samples are randomly selected from each class of the dataset in the beginning. 1000 samples are selected and labeled after each training process, and the final size of labeled dataset is 10000. To overcome the influence of random factors and get objective results, we generate 10 sets of labeled samples at first, and do the first training processes of all the methods on the same 10 datasets. Final results are the average of the ten.
The results on CIFAR-10 is displayed by Fig.2
. Because the output features are probability of all classes, entropy and least confidence measure can be calculated on the outputs directly. For the calculation of prediction stability, we finally select 5 epochs with an interval of 5.
The results show that although information about the value of outputs are not included directly, the proposed prediction stability method still overwhelms random selection, and has similar performance with acquisition functions like entropy and least confidence on CIFAR-10.
The performance of each method on CIFAR-100 is exhibited in Fig.3. To perform prediction stability on CIFAR-100, the interval of epoch selection is set to 1. Previous works hardly report on this dataset, but our results on CIFAR-100 show totally different tendency with CIFAR-10. Entropy and least confidence, especially least confidence, suffer from deterioration of performance. Accuracy of both acquisition functions are lower than random selection. But our proposed method proves better performance and clearly outperforms random selection.
We believe the better performance on CIFAR-100 than CIFAR-10 is caused by sample size in the feature space. The major difference between the two datasets is CIFAR-100 has less samples in each class, which means the feature space of each class is more sparse and has less labels to distinguish the boarder. The result on the two datasets proves that prediction stability has better capacity on fewer-labeled dataset.
4.4 Ablation Study
4.4.1 measure of prediction stability
Experiments are made to test performance of different measure of prediction stability, as displayed in Fig.4. We test absolute increase among features of different epochs. This measure is represented by eq.5.
The result shows that absolute increase lead to nearly 40
% drop of performance. We assume that it means it’s not the tendency, but the distribution of output, that determines the performance of prediction stability. Also, we test the result of taking variance as the acquisition function, but leaving the output features not transformed by a softmax layer. An deterioration of result can also be observed clearly. We believe this is caused by softmax layer’s function of normalization. The output features of different samples are transferred into comparable probabilities, and therefore the differences on absolute scales of output features don’t influence the variances.
4.4.2 interval of epoch selection
Experiments are made to test influence of epoch selection on the results of prediction stability. The epoch selection process is based on eq.4. Results on the two datasets are different, as exhibited in Fig.5. Although accuracy tend to be the best when interval is 5, CIFAR-10 is not sensitive to interval change. But in CIFAR-100, the accuracy declines as interval of epoch increase. This happens may because models over-fit on CIFAR-100 later than CIFAR-10. When the interval is 10, result of some epochs of CIFAR-100 is still not stable enough and caused the decrease of accuracy.
In this paper, we propose a new class of AL method named sequential-based AL method. A new criterion, prediction stability is proposed as an application of sequential-based method. Testing results of prediction stability on CIFAR-10 and CIFAR-100 prove the feasibility of the sequential-based method class. As for the future work, we will focus on fusing our proposed method with uncertainty-based AL methods, because the information extracted by two kinds of methods are complementary.