is one of the most influential benchmarks that has helped the progress of machine learning research. Tremendous progresses have been made over the past decade in terms of a model’s accuracy on the ImageNet dataset. While improving the ImageNet accuracy has been the major focus of the past, little effort has been made on studying the resource efficiency in ImageNet supervised learning. Most existing supervised learning methods assume the data is i.i.d., static and pre-existing. They train models with multiple epochs, i.e., multiple passes of the whole dataset. Specifically, a top-performing Residual Neural NetworkHe2016DeepRL is trained with 90 epochs; the model needs to review each of the 1.2M examples 90 times. One natural question to ask is whether it is necessary to train a model with so many passes over the whole data.
We are also motivated by the fact that real world data often comes hourly or daily in a stream and in a much larger scale. Maintaining all the data in a storage can be expensive and probably unnecessary. Additionally, real world data sometimes contains private information from human users, which further restricts the possibility of saving all the data into a separate storage. Without a pre-recorded dataset, the popular multi-epoch training method becomes impractical to these real world scenarios.
We propose the One Pass ImageNet (OPIN) problem to study the resource efficiency of deep learning from a streaming setting with constrained data storage, where space complexity is considered an important evaluation metric. The goal is to develop a system that trains a model where each example is passed to the system only once. There is a small memory budget but no restriction on how the system utilizes its memory; it could store and revisit some examples but not all. Unlike the task-incremental continual learning settingmasana2021classincremental; taskonomycl, the One-Pass ImageNet problem does not have a special ordering of the data, nor is there a specific distribution shift as in task-free continual learning cai2021online; aljundi2019taskfree; DBLP:journals/corr/abs-2106-12772. That is, the data comes from a fixed uniform random order. We use ResNet-50 He2016DeepRL in all our experiments, and leave the question of choice of architecture as future work.
We observe that training a ResNet-50 He2016DeepRL in a single pass leads to only 30.6% top-1 accuracy on the validation set, a significant drop from 76.9% top-1 accuracy obtained from the common 90-epoch training111The top-1 accuracy is obtained with 1-crop evaluation in ImageNet.. Inspired by the effectiveness of memory-based continual learning taskonomycl; buzzega2020rethinking; rolnick2019experience; bangKYHC21, we propose an error-prioritized replay (EPR) method to One Pass ImageNet. The proposed approach utilizes a priority function based on predictive error. Results show that EPR achieves 65.0% top-1 accuracy, improving over naive one-pass training by 34.4%. Although it still performs 11.9% lower than multi-epoch training in terms of accuracy, EPR shows superior resource efficiency which reduces total gradient update steps by 90% and total required data storage by 90%.
We believe OPIN is an important first step that allows us to understand how existing techniques can train models in terms of computation and storage efficiency, although it may not be the most realistic example for the data streaming setting. We hope our results could inspire future research on large scale benchmarks and novel algorithms on resource-efficient supervised learning.
2 One Pass ImageNet
The One Pass ImageNet problem assumes that examples are sent in mini-batches and do not repeat. The training procedure ends when the whole dataset is revealed. No restriction is applied on how the trainer utilizes its own memory, so a memory buffer that records past examples is allowed. However, the amount of data storage is a major evaluation metric considered as space efficiency.
We perform our study on a commonly used ImageNet solution: A ResNet-50 He2016DeepRL trained over 90 epochs with cosine learning rate and augmented examples. We refer to this method as Multi-epoch throughout the paper. The images are preprocessed by resizing to and then performing augmentation into a size of . During training, the augmentation includes random horizontal flipping and random cropping. At test time, only center cropping is applied to the images.
|Accuracy (%)||Storage (%)||Compute (%)|
|Multi-epoch (90 epochs)||76.9||100||100|
|One-Pass (Prioritized Replay)||65.0||10||10|
Evaluation metrics. While standard ImageNet benchmark focuses on a model’s overall accuracy, the One-Pass ImageNet problem aims at studying the learning capability under constrained space and computation. So the problem becomes essentially a multi-objective problem. We propose to evaluate training methods using three major metrics: (1) accuracy, represented by the top-1 accuracy in the test set, (2) space, represented by total additional data storage needed, and (3) compute, represented by the total number of global steps for back-propagation. The space and compute metric is calculated relative to the multi-epoch training method, i.e., both metrics for Multi-epoch method are 100%. The Multi-epoch method needs to save all the data into storage, so the space metric is measured by the size of the data storage divided by the size of the dataset. The Multi-epoch method needs to train a model with 90 epochs (or 100M global steps), so the compute metric is measured by the total number of back-propagation operations divided by 100M.
Naive baseline. A simple baseline method for the One Pass problem is to train a model with the same training configuration that multi-epoch training uses but with only a single epoch, which we call Naive One-Pass. Since each example is seen only once, we replace the random augmentation with center cropping (which is used in the model evaluation) in this Naive baseline. Table 1 shows a comparison between multi-epoch training and naive one-pass training measured in three metrics: accuracy, space and compute. All metrics are in percentage. While Naive One-Pass is significantly worse than multi-epoch in terms of accuracy, its space and compute efficiency are both significantly higher. The Naive One-Pass does not need to save any data examples into memory and at the same time it only trains for one-epoch, the total number of training steps is that of multi-epoch training.
Problem Characteristics. Here we list four properties of the OPIN problem as below:
The cold-start problem: Model start from random initialization. So the representation learning becomes challenging in OPIN especially during the early stage of the training.
The forgetting problem: Each example is passed to the model only once. Even though the data is i.i.d., vanilla supervised learning is likely to incur forgetting of early examples.
A natural ordering of data: No artificial order of the data is enforced. So the data can be seen as i.i.d., which is different from many existing continual learning benchmarks.
Multiple objectives: The methods are evaluated using three metrics (accuracy, space and compute), so the goal is to improve all three metrics in a single training method.
3 A Prioritized Replay Baseline
Memory replay is a common approach in continual learning rolnick2019experience. Existing works have shown that memory replay is effective in sequential learning Hsu18_EvalCL; taskonomycl. As our first investigation to the One Pass ImageNet problem, we study how a replay buffer could improve the overall performance in OPIN.
3.1 Replay buffer
Replay buffer is an extra memory that explicitly saves the data. This memory usually has a very limited size. At each training step, the received mini-batch of examples is inserted into this replay memory. Since the buffer size is smaller than the whole dataset, the typical solution is to apply the reservoir sampling strategy reservoirsampling where each example is inserted at a probability of , where is the memory size and the total number of seen examples. In order to introduce a favor on fresh examples222Similar ideas on encouraging recent examples in reservoir sampling can be found in biased-reservoir; osborne-etal-2014-exponential., we incorporate a factor to the inclusion probability, i.e., (we choose ), so that more recent examples will be included in the memory.
At each training step, extra examples are sampled from the replay buffer. These examples are trained by the model together with the incoming mini-batch. Existing research on continual learning has shown that uniform sampling from a replay buffer is effective in many cases taskonomycl, however, a common intuition is that examples are not equally important when being replayed. The idea of prioritized experience replay schaul2016prioritized
is to add a priority score to each example and sample from the buffer according to the probability distribution normalized from the priority scores.
In order to apply data augmentation to replay examples, we save jpeg-encoded image bytes into the replay buffer instead of image tensors, which turns out to be more space efficient (3x more examples can be saved under the same memory budget). We found replaying multiple examples at each step could dramatically improve the model accuracy, although with a trade-off in compute. For each step when we receive one mini-batch of images, we replaymini-batches from the replay buffer, which leads to epochs of compute effectively.
3.2 Priority function
We study a prioritized replay method that uses the predictive error (loss value) as the priority function, which we call Error-Prioritized Replay (EPR). The priority function is defined as
where is the loss value given input example , ground truth label and model parameter . The smoothing factor varies from 0 to 1 such that where is the current global step and is the maximum global step. We choose a smoothed
because the model’s prediction is not trustworthy at the early stage of training. When the loss function is cross-entropy, it can also be shown that the priority value , which is made of the model’s confidence on the ground truth label.
3.3 Importance weight
The examples sampled from the priority replay buffer changes the data distribution. To simplify the notation, we omit the label from the equations in this section. Let be the distribution of the original data and be the distribution in the replay buffer. The original objective is . Supposing mini-batches are sampled from the replay buffer at each step, directly combining replay examples and current examples will lead to a minimization of . In order to correct the distribution shift, we use an importance weight to each replay example because . Given that the original distribution can be assumed uniform (), the importance weight of each replay example is , inversely proportional to its priority value. The weights of each mini-batch are normalized to mean 1.
4.1 Experimental setup
The model in our study is a Residual Neural Network He2016DeepRL with 50 layers (aka.
ResNet-50). We use cosine learning rate decay for all experiments, with initial learning rate 0.1. The model is optimized using stochastic gradient descent with Nesterov momentum 0.9. The batch size is 128. We evaluate different approaches with 10 different random orders of the data.
|Effective||Computation||Storage (Prioritized Replay)||Multi-epoch|
|Epochs||1 %||5 %||10 %||100% Storage|
The results are shown in Table 2. We performed 9 experiments with the replay steps being 1, 3, 5, 8 and the size of replay buffer being 1%, 5% and 10% of the dataset. The effective epoch is computed as adding the number of replay steps by 1 because at receiving each new mini-batch, mini-batches of the same size are sampled from the replay buffer. So effectively, back-propagation operations are performed when the model sees a new mini-batch. In addition to the One-Pass solutions, we also show the results of multi-epoch training with the same number of effective epochs. The learning rate decay is adjusted accordingly to decay to minimum at the end of the corresponding epoch. We observe multiple trends from this table:
Prioritized replay with 10% memory size achieves performance very close to the multi-epoch training method under the same computational cost. And the multi-epoch method utilizes the full dataset which requires a large data storage. It is unknown whether multi-epoch gives a performance upper bound, we believe it is a strong target performance to reference.
1% data storage gives a strong starting point for prioritized replay. Having a 1% data storage (equivalent to 100 mini-batches) dramatically improves the naive One-Pass performance by 28.7%. According to the table, the accuracy increase from 1% to 10% storage is 1.0% for 2 epochs, 1.7% for 4 epochs, 3.3% for 6 epochs and 5.7% for 9 epochs.
When the buffer size becomes bigger, the accuracy gains more when the number of replay steps is more. From 5% size to 10% size, the model accuracy increase by 0.6% and 0.1% respectively for replay step 1 and 3, while the accuracy increases by 0.9% for replay step 5. The model accuracy saturates quickly if one only increases either storage size or the replay steps. Increasing both of them could potentially incur much bigger accuracy boost.
Priority function. Like recent literature taskonomycl; chaudhry2020using
suggested, we also observe competitive results using vanilla replay (with a uniform sample memory) in the One-Pass ImageNet problem. Specifically, a uniform memory leads to 64.7% top-1 accuracy with 8 replay steps and 10% storage, slightly worse than the prioritized replay method (65.0%). The standard errors of mean for both are 0.07%. Although the improvement is small, we believe the potential of prioritized replay is not fully explored in this problem and we leave the study of a better designed prioritization as future research.
Importance weights. To study the effectiveness of importance weights applied to the loss function, we evaluate the prioritized replay without importance weight at 1% storage. With 5 replay steps (6 effective epochs), no importance weight results in 58.0% top-1 accuracy, 0.9% lower than the one with importance weight. With 8 replay steps, no importance weight results in 58.3% accuracy, 1.0% lower than using importance weight. The reason of this accuracy gap is due to the fact that the distribution of examples in the replay buffer differs from that of the evaluation data set.
Priority updating. An interesting and somewhat surprising result we obtained is that updating the priorities of the examples retrieved from the replay buffer does not lead to better accuracy. An experiment using 5 replay steps and 1% storage shows that, after an example is replayed from the buffer, immediately updating its priority in the replay buffer using the latest model parameters results in a top-1 accuracy of 58.8% while we can achieve 58.9% without updating.
5 Related Work
The OPIN problem is related to continual learning DBLP:journals/corr/abs-2106-12772; yin2020sola; borsos2020coresets; isele2018selective; van2020brain; chaudhry2018efficient; rusu2016progressive and incremental learning Castro_2018_ECCVHsu18_EvalCL; kirkpatrick2017overcoming; farajtabar2020orthogonal. Recent efforts are exerted on benchmarks built upon large scale datasets such as ImageNet wu2019large and Taskonomy taskonomycl. Class-incremental ImageNet wu2019large is probably the closest to our benchmark in continual learning which splits ImageNet into multiple tasks and each tasks consist of examples labeled as a certain subset of classes. The data for each task is given to the model altogether, so the model is able to repeat multiple passes for the data in the same task. OPIN differs in that it has no task concept and the data obeys a natural order instead of a manually designed class incremental ordering.
The OPIN problem is also related to stream data learning journals/sigkdd/GomesRBBG19; Souza2020ChallengesIB; dieuleveut2016nonparametric and online learning 10.1109/TIT.2004.833339; 6842642; onlinesgd. While there have been many benchmarks utilizing real world streaming data Souza2020ChallengesIB
, they often have smaller scale, less complex features and limited number of classes. Their solutions are often limited to linear or convex models. The One-Pass ImageNet problem could also contribute to the research of stream data learning from the perspective of deep neural architectures and large scale high dimensional data.
We presented the One-Pass ImageNet (OPIN) problem which aims at studying not only the predictive accuracy along with the space and compute efficiency of a supervised learning algorithm. The problem is motivated from real world applications that comes with a stream of extremely large scale data, which leads to impracticality of saving all the data into a dedicate storage and iterating the data for many epochs. We proposed an Error-Prioritized Replay baseline to this problem which achieves 65.0% top-1 accuracy while reducing the required data storage by 90% and reducing the total gradient update compute by 90%, compared against the popular 90-epoch training procedure in ImageNet. We hope OPIN could inspire future research on improving resource efficiency in supervised learning.
The authors would like to thank Razvan Pascanu, Murray Shanahan, Eren Sezener, Jianan Wang, Talfan Evans, Sam Smith, Soham De, Yazhe Li, Amal Rannen-Triki, Sven Gowal, Dong Yin, Arslan Chaudhry, Mehrdad Farajtabar, Timothy Nguyen, Nevena Lazic, Iman Mirzadeh, Timothy Mann and John Maggs for their meaningful discussions and contributions.