1 Introduction
ImageNet imagenet_cvpr09
is one of the most influential benchmarks that has helped the progress of machine learning research. Tremendous progresses have been made over the past decade in terms of a model’s accuracy on the ImageNet dataset. While improving the ImageNet accuracy has been the major focus of the past, little effort has been made on studying the resource efficiency in ImageNet supervised learning. Most existing supervised learning methods assume the data is i.i.d., static and preexisting. They train models with multiple epochs, i.e., multiple passes of the whole dataset. Specifically, a topperforming Residual Neural Network
He2016DeepRL is trained with 90 epochs; the model needs to review each of the 1.2M examples 90 times. One natural question to ask is whether it is necessary to train a model with so many passes over the whole data.We are also motivated by the fact that real world data often comes hourly or daily in a stream and in a much larger scale. Maintaining all the data in a storage can be expensive and probably unnecessary. Additionally, real world data sometimes contains private information from human users, which further restricts the possibility of saving all the data into a separate storage. Without a prerecorded dataset, the popular multiepoch training method becomes impractical to these real world scenarios.
We propose the One Pass ImageNet (OPIN) problem to study the resource efficiency of deep learning from a streaming setting with constrained data storage, where space complexity is considered an important evaluation metric. The goal is to develop a system that trains a model where each example is passed to the system only once. There is a small memory budget but no restriction on how the system utilizes its memory; it could store and revisit some examples but not all. Unlike the taskincremental continual learning setting
masana2021classincremental; taskonomycl, the OnePass ImageNet problem does not have a special ordering of the data, nor is there a specific distribution shift as in taskfree continual learning cai2021online; aljundi2019taskfree; DBLP:journals/corr/abs210612772. That is, the data comes from a fixed uniform random order. We use ResNet50 He2016DeepRL in all our experiments, and leave the question of choice of architecture as future work.We observe that training a ResNet50 He2016DeepRL in a single pass leads to only 30.6% top1 accuracy on the validation set, a significant drop from 76.9% top1 accuracy obtained from the common 90epoch training^{1}^{1}1The top1 accuracy is obtained with 1crop evaluation in ImageNet.. Inspired by the effectiveness of memorybased continual learning taskonomycl; buzzega2020rethinking; rolnick2019experience; bangKYHC21, we propose an errorprioritized replay (EPR) method to One Pass ImageNet. The proposed approach utilizes a priority function based on predictive error. Results show that EPR achieves 65.0% top1 accuracy, improving over naive onepass training by 34.4%. Although it still performs 11.9% lower than multiepoch training in terms of accuracy, EPR shows superior resource efficiency which reduces total gradient update steps by 90% and total required data storage by 90%.
We believe OPIN is an important first step that allows us to understand how existing techniques can train models in terms of computation and storage efficiency, although it may not be the most realistic example for the data streaming setting. We hope our results could inspire future research on large scale benchmarks and novel algorithms on resourceefficient supervised learning.
2 One Pass ImageNet
The One Pass ImageNet problem assumes that examples are sent in minibatches and do not repeat. The training procedure ends when the whole dataset is revealed. No restriction is applied on how the trainer utilizes its own memory, so a memory buffer that records past examples is allowed. However, the amount of data storage is a major evaluation metric considered as space efficiency.
We perform our study on a commonly used ImageNet solution: A ResNet50 He2016DeepRL trained over 90 epochs with cosine learning rate and augmented examples. We refer to this method as Multiepoch throughout the paper. The images are preprocessed by resizing to and then performing augmentation into a size of . During training, the augmentation includes random horizontal flipping and random cropping. At test time, only center cropping is applied to the images.
Accuracy (%)  Storage (%)  Compute (%)  

Multiepoch (90 epochs)  76.9  100  100 
OnePass (Naive)  30.6  0  1.1 
OnePass (Prioritized Replay)  65.0  10  10 
Evaluation metrics. While standard ImageNet benchmark focuses on a model’s overall accuracy, the OnePass ImageNet problem aims at studying the learning capability under constrained space and computation. So the problem becomes essentially a multiobjective problem. We propose to evaluate training methods using three major metrics: (1) accuracy, represented by the top1 accuracy in the test set, (2) space, represented by total additional data storage needed, and (3) compute, represented by the total number of global steps for backpropagation. The space and compute metric is calculated relative to the multiepoch training method, i.e., both metrics for Multiepoch method are 100%. The Multiepoch method needs to save all the data into storage, so the space metric is measured by the size of the data storage divided by the size of the dataset. The Multiepoch method needs to train a model with 90 epochs (or 100M global steps), so the compute metric is measured by the total number of backpropagation operations divided by 100M.
Naive baseline. A simple baseline method for the One Pass problem is to train a model with the same training configuration that multiepoch training uses but with only a single epoch, which we call Naive OnePass. Since each example is seen only once, we replace the random augmentation with center cropping (which is used in the model evaluation) in this Naive baseline. Table 1 shows a comparison between multiepoch training and naive onepass training measured in three metrics: accuracy, space and compute. All metrics are in percentage. While Naive OnePass is significantly worse than multiepoch in terms of accuracy, its space and compute efficiency are both significantly higher. The Naive OnePass does not need to save any data examples into memory and at the same time it only trains for oneepoch, the total number of training steps is that of multiepoch training.
Problem Characteristics. Here we list four properties of the OPIN problem as below:

The coldstart problem: Model start from random initialization. So the representation learning becomes challenging in OPIN especially during the early stage of the training.

The forgetting problem: Each example is passed to the model only once. Even though the data is i.i.d., vanilla supervised learning is likely to incur forgetting of early examples.

A natural ordering of data: No artificial order of the data is enforced. So the data can be seen as i.i.d., which is different from many existing continual learning benchmarks.

Multiple objectives: The methods are evaluated using three metrics (accuracy, space and compute), so the goal is to improve all three metrics in a single training method.
3 A Prioritized Replay Baseline
Memory replay is a common approach in continual learning rolnick2019experience. Existing works have shown that memory replay is effective in sequential learning Hsu18_EvalCL; taskonomycl. As our first investigation to the One Pass ImageNet problem, we study how a replay buffer could improve the overall performance in OPIN.
3.1 Replay buffer
Replay buffer is an extra memory that explicitly saves the data. This memory usually has a very limited size. At each training step, the received minibatch of examples is inserted into this replay memory. Since the buffer size is smaller than the whole dataset, the typical solution is to apply the reservoir sampling strategy reservoirsampling where each example is inserted at a probability of , where is the memory size and the total number of seen examples. In order to introduce a favor on fresh examples^{2}^{2}2Similar ideas on encouraging recent examples in reservoir sampling can be found in biasedreservoir; osborneetal2014exponential., we incorporate a factor to the inclusion probability, i.e., (we choose ), so that more recent examples will be included in the memory.
At each training step, extra examples are sampled from the replay buffer. These examples are trained by the model together with the incoming minibatch. Existing research on continual learning has shown that uniform sampling from a replay buffer is effective in many cases taskonomycl, however, a common intuition is that examples are not equally important when being replayed. The idea of prioritized experience replay schaul2016prioritized
is to add a priority score to each example and sample from the buffer according to the probability distribution normalized from the priority scores.
In order to apply data augmentation to replay examples, we save jpegencoded image bytes into the replay buffer instead of image tensors, which turns out to be more space efficient (3x more examples can be saved under the same memory budget). We found replaying multiple examples at each step could dramatically improve the model accuracy, although with a tradeoff in compute. For each step when we receive one minibatch of images, we replay
minibatches from the replay buffer, which leads to epochs of compute effectively.3.2 Priority function
We study a prioritized replay method that uses the predictive error (loss value) as the priority function, which we call ErrorPrioritized Replay (EPR). The priority function is defined as
(1) 
where is the loss value given input example , ground truth label and model parameter . The smoothing factor varies from 0 to 1 such that where is the current global step and is the maximum global step. We choose a smoothed
because the model’s prediction is not trustworthy at the early stage of training. When the loss function is crossentropy
, it can also be shown that the priority value , which is made of the model’s confidence on the ground truth label.3.3 Importance weight
The examples sampled from the priority replay buffer changes the data distribution. To simplify the notation, we omit the label from the equations in this section. Let be the distribution of the original data and be the distribution in the replay buffer. The original objective is . Supposing minibatches are sampled from the replay buffer at each step, directly combining replay examples and current examples will lead to a minimization of . In order to correct the distribution shift, we use an importance weight to each replay example because . Given that the original distribution can be assumed uniform (), the importance weight of each replay example is , inversely proportional to its priority value. The weights of each minibatch are normalized to mean 1.
4 Experiments
4.1 Experimental setup
The model in our study is a Residual Neural Network He2016DeepRL with 50 layers (aka.
ResNet50). We use cosine learning rate decay for all experiments, with initial learning rate 0.1. The model is optimized using stochastic gradient descent with Nesterov momentum 0.9. The batch size is 128. We evaluate different approaches with 10 different random orders of the data.
Effective  Computation  Storage (Prioritized Replay)  Multiepoch  

Epochs  1 %  5 %  10 %  100% Storage  
2  44.7  45.1  45.7  46.1  
4  55.5  57.1  57.2  59.0  
6  58.9  61.3  62.2  64.1  
9  59.3  63.2  65.0  68.2 
4.2 Results
The results are shown in Table 2. We performed 9 experiments with the replay steps being 1, 3, 5, 8 and the size of replay buffer being 1%, 5% and 10% of the dataset. The effective epoch is computed as adding the number of replay steps by 1 because at receiving each new minibatch, minibatches of the same size are sampled from the replay buffer. So effectively, backpropagation operations are performed when the model sees a new minibatch. In addition to the OnePass solutions, we also show the results of multiepoch training with the same number of effective epochs. The learning rate decay is adjusted accordingly to decay to minimum at the end of the corresponding epoch. We observe multiple trends from this table:

Prioritized replay with 10% memory size achieves performance very close to the multiepoch training method under the same computational cost. And the multiepoch method utilizes the full dataset which requires a large data storage. It is unknown whether multiepoch gives a performance upper bound, we believe it is a strong target performance to reference.

1% data storage gives a strong starting point for prioritized replay. Having a 1% data storage (equivalent to 100 minibatches) dramatically improves the naive OnePass performance by 28.7%. According to the table, the accuracy increase from 1% to 10% storage is 1.0% for 2 epochs, 1.7% for 4 epochs, 3.3% for 6 epochs and 5.7% for 9 epochs.

When the buffer size becomes bigger, the accuracy gains more when the number of replay steps is more. From 5% size to 10% size, the model accuracy increase by 0.6% and 0.1% respectively for replay step 1 and 3, while the accuracy increases by 0.9% for replay step 5. The model accuracy saturates quickly if one only increases either storage size or the replay steps. Increasing both of them could potentially incur much bigger accuracy boost.
4.3 Discussion
Priority function. Like recent literature taskonomycl; chaudhry2020using
suggested, we also observe competitive results using vanilla replay (with a uniform sample memory) in the OnePass ImageNet problem. Specifically, a uniform memory leads to 64.7% top1 accuracy with 8 replay steps and 10% storage, slightly worse than the prioritized replay method (65.0%). The standard errors of mean for both are 0.07%. Although the improvement is small, we believe the potential of prioritized replay is not fully explored in this problem and we leave the study of a better designed prioritization as future research.
Importance weights. To study the effectiveness of importance weights applied to the loss function, we evaluate the prioritized replay without importance weight at 1% storage. With 5 replay steps (6 effective epochs), no importance weight results in 58.0% top1 accuracy, 0.9% lower than the one with importance weight. With 8 replay steps, no importance weight results in 58.3% accuracy, 1.0% lower than using importance weight. The reason of this accuracy gap is due to the fact that the distribution of examples in the replay buffer differs from that of the evaluation data set.
Priority updating. An interesting and somewhat surprising result we obtained is that updating the priorities of the examples retrieved from the replay buffer does not lead to better accuracy. An experiment using 5 replay steps and 1% storage shows that, after an example is replayed from the buffer, immediately updating its priority in the replay buffer using the latest model parameters results in a top1 accuracy of 58.8% while we can achieve 58.9% without updating.
5 Related Work
The OPIN problem is related to continual learning DBLP:journals/corr/abs210612772; yin2020sola; borsos2020coresets; isele2018selective; van2020brain; chaudhry2018efficient; rusu2016progressive and incremental learning Castro_2018_ECCV
. Existing continual learning methods are often evaluated in benchmarks composed of standard supervised learning datasets such as MNIST and CIFAR
Hsu18_EvalCL; kirkpatrick2017overcoming; farajtabar2020orthogonal. Recent efforts are exerted on benchmarks built upon large scale datasets such as ImageNet wu2019large and Taskonomy taskonomycl. Classincremental ImageNet wu2019large is probably the closest to our benchmark in continual learning which splits ImageNet into multiple tasks and each tasks consist of examples labeled as a certain subset of classes. The data for each task is given to the model altogether, so the model is able to repeat multiple passes for the data in the same task. OPIN differs in that it has no task concept and the data obeys a natural order instead of a manually designed class incremental ordering.The OPIN problem is also related to stream data learning journals/sigkdd/GomesRBBG19; Souza2020ChallengesIB; dieuleveut2016nonparametric and online learning 10.1109/TIT.2004.833339; 6842642; onlinesgd. While there have been many benchmarks utilizing real world streaming data Souza2020ChallengesIB
, they often have smaller scale, less complex features and limited number of classes. Their solutions are often limited to linear or convex models. The OnePass ImageNet problem could also contribute to the research of stream data learning from the perspective of deep neural architectures and large scale high dimensional data.
6 Conclusion
We presented the OnePass ImageNet (OPIN) problem which aims at studying not only the predictive accuracy along with the space and compute efficiency of a supervised learning algorithm. The problem is motivated from real world applications that comes with a stream of extremely large scale data, which leads to impracticality of saving all the data into a dedicate storage and iterating the data for many epochs. We proposed an ErrorPrioritized Replay baseline to this problem which achieves 65.0% top1 accuracy while reducing the required data storage by 90% and reducing the total gradient update compute by 90%, compared against the popular 90epoch training procedure in ImageNet. We hope OPIN could inspire future research on improving resource efficiency in supervised learning.
Acknowledgement.
The authors would like to thank Razvan Pascanu, Murray Shanahan, Eren Sezener, Jianan Wang, Talfan Evans, Sam Smith, Soham De, Yazhe Li, Amal RannenTriki, Sven Gowal, Dong Yin, Arslan Chaudhry, Mehrdad Farajtabar, Timothy Nguyen, Nevena Lazic, Iman Mirzadeh, Timothy Mann and John Maggs for their meaningful discussions and contributions.
Comments
There are no comments yet.