DeepAI
Log In Sign Up

Label-Efficient Online Continual Object Detection in Streaming Video

06/01/2022
by   Jay Zhangjie Wu, et al.
0

To thrive in evolving environments, humans are capable of continual acquisition and transfer of new knowledge, from a continuous video stream, with minimal supervisions, while retaining previously learnt experiences. In contrast to human learning, most standard continual learning benchmarks focus on learning from static iid images in fully supervised settings. Here, we examine a more realistic and challenging problemx2014Label-Efficient Online Continual Object Detection (LEOCOD) in video streams. By addressing this problem, it would greatly benefit many real-world applications with reduced annotation costs and retraining time. To tackle this problem, we seek inspirations from complementary learning systems (CLS) in human brains and propose a computational model, dubbed as Efficient-CLS. Functionally correlated with the hippocampus and the neocortex in CLS, Efficient-CLS posits a memory encoding mechanism involving bidirectional interaction between fast and slow learners via synaptic weight transfers and pattern replays. We test Efficient-CLS and competitive baselines in two challenging real-world video stream datasets. Like humans, Efficient-CLS learns to detect new object classes incrementally from a continuous temporal stream of non-repeating video with minimal forgetting. Remarkably, with only 25 models, which are trained with 100 and source code will be publicly available at https://github.com/showlab/Efficient-CLS.

READ FULL TEXT VIEW PDF

page 3

page 5

page 6

page 10

page 12

page 13

page 14

page 15

03/29/2022

Online Continual Learning on a Contaminated Data Stream with Blurry Task Boundaries

Learning under a continuously changing data distribution with incorrect ...
09/06/2022

Continual Learning: Fast and Slow

According to the Complementary Learning Systems (CLS) theory <cit.> in n...
05/02/2020

Visually Grounded Continual Learning of Compositional Semantics

Children's language acquisition from the visual world is a real-world ex...
10/01/2021

DualNet: Continual Learning, Fast and Slow

According to Complementary Learning Systems (CLS) theory <cit.> in neuro...
11/17/2017

Scalable Recollections for Continual Lifelong Learning

Given the recent success of Deep Learning applied to a variety of single...
06/29/2022

Continual Learning for Human State Monitoring

Continual Learning (CL) on time series data represents a promising but u...
08/14/2020

RODEO: Replay for Online Object Detection

Humans can incrementally learn to do new visual detection tasks, which i...

Code Repositories

Efficient-CLS

[arXiv2022] Label-Efficient Online Continual Object Detection in Streaming Video


view repo

1 Introduction

Humans have the ability to continuously learn from an ever-changing environment, while retaining previously learnt experiences. In contrast to human learning, recent works in continual learningaljundi2019gradient ; aljundi2019online ; fini2020online ; wang2021wanderlust ; caccia2022new

show that deep neural networks (DNNs) are prone to catastrophic forgetting. Online Continual Learning (OCL) remains a challenging problem where an agent learns from a never-ending stream of data in only a single pass. Existing works in OCL

aljundi2019gradient ; aljundi2019online ; fini2020online ; caccia2022new primarily focus on image classification tasks. However, later works perez2020incremental ; wang2021wanderlust show that these methods underperform in more complex vision tasks, such as object detection. Moreover, these experiments in incremental image classification are conducted in an idealistic and simplified scenario when a list of object classes is learnt in a fixed order, and the learnt classes become inaccessible after being presented once. It deviates from the OCL in the real world where novel classes could co-occur with learnt classes in the same scene due to context regularities.

Here, to imitate what humans see and learn, we examine a more realistic and challenging problem of online continual object detection in real-world video streams. Although teaching machines to continuously detect objects on every video frame with full supervision is the most ideal, annotating bounding boxes for all objects and labeling their corresponding object classes per video frame is a very daunting task. Here, our goal is to study online continual object detection problem in video streams with minimal amount of annotations. We introduce our problem setting in Figure 1.

Figure 1: Label-efficient online continual object detection in video streams. (a) Problem introduction: As an agent continuously learns from a video stream, the ground truth labels from a certain percentage number of the video frames (green boundary) are revealed to the agent, while the majority of frames (orange boundary) are annotation-free. We define annotation cost as the proportion of the number of annotated video frames out of a mini-batch containing 16 consecutive video frames. To construct a test set, the last video frame (transparent) from every mini-batch is held out for testing, and it is not presented to the agent during training. We stamp every learning step from a mini-batch as a time step. After every 100 time steps, the agent is evaluated on all the images from the test set for object detection. (b) Our proposed model (red) outperforms the best competitive baseline (blue) by a margin of 5% over all proportions of annotation costs. Remarkably, our model, trained at 25% annotation costs, surpasses the best baseline trained at 100% annotation costs (grey line). The orange cross denotes the performance of the state-of-the-art model, which is 15% lower than ours.

Our work shares similar motivations with works in semi-supervised learning. Over the past few years, several object detection approaches

liu2021unbiased ; zhou2021instant ; tang2021humble

in semi-supervised learning have been proposed to reduce annotation costs. However, they follow an offline training protocol where labeled and unlabeled data can be used repeatedly across hundreds of epochs. Moreover, these studies focus on object detection problem in static independent and identically distributed (iid) images and discard the temporal correlation in the real-world videos. To our best knowledge, we are the first to study label-efficient online continual object detection in video streams.

Cognitive science works wang2020generalizing ; lake2017building show that humans are efficient at continuously learning from very few annotated data samples. We get inspirations from the theory of Complementary Learning System (CLS) in human brains, and propose a general framework for Label-Efficient Online Continual Object Detection, dubbed as Efficient-CLS model. In particular, CLS theory postulates that the memories are first encoded via fast synaptic changes in the hippocampus and then these changes support slow reinstatement of memories in the neocortex via accumulated experiences over time kumaran2016learning

. To mimic the fast and slow synaptic changes in hippocampus and neorcotex, in Efficient-CLS, we introduce two feed-forward neural networks as slow and fast learners. In the fast learner, memory is encoded in its synaptic weights and these weights adapt rapidly to the current task. The synapses of the slow learner change a little on each reinstatement, and are maintained by taking the exponential moving average of the fast learner’s synapses over time. Though a few continual learning models in previous works

arani2022learning ; pham2020contextual also use a similar source of inspiration, they miss the effect of reciprocal connections from neocortex to hippocampus, which we intend to address. The neuroscience study ji2007coordinated has identified the importance of bidirectional interaction between the two complementary systems, whereby the reactivation of neural patterns in the neocortex triggers replays in the hippocampus, which in turn drive the memory consolidation in the neocortex. Inspired by this underlying mechanism, we reactivate the weights of the slow learners to predict meaningful pseudo labels from the unlabeled video frames and use these pseudo labels to guide the training of the fast learner, closing the loop between the two complementary learning systems. Specifically, pseudo labels, predicted by the slow learner, carry integrated semantic information over time, which encourages the fast learner to capture more holistic scene representations, alleviating the catastrophic forgetting problem on sparsely annotated videos.

We benchmark our method on online continual object detection with video streams from OAK and EgoObjects datasets. We conduct comprehensive experiments to evaluate the effectiveness of our method in alleviating catastrophic forgetting and reducing annotation costs. Our model Efficient-CLS consistently outperforms the existing methods by a large margin of 5% over all annotation costs. At only 25% annotation costs, Efficient-CLS even surpasses all the comparative baseline models trained with 100% annotation costs. Our contributions of this paper are two-fold:

  • We introduce a new, challenging and important problem of label-efficient online continual object detection in video streams. Solving this problem would greatly benefit real-world applications in minimizing annotation costs and reducing model retraining time.

  • To tackle this problem, we propose a computational model inspired from the theory of Complementary Learning System. It beats all competitive baselines in object detection tasks by a large margin with minimal forgetting and minimal amount of annotations.

2 Related Work

2.1 Complementary Learning System (CLS)

The essence of fast and slow learning in CLS has benefited several continual learning applications in object recognition pham2020contextual ; pham2021dualnet ; rostami2019complementary ; arani2022learning ; kamra2017deep . However, these methods either require the task boundaries, which are not applicable in our online problem setting, or they require to train fast and slow learning systems with replay samples from the same replay buffer, which could easily lead to overfitting problem when the replay buffer has limited capacity. To eliminate overfitting problem, Rostami et al. rostami2019complementary and Kamra et al. kamra2017deep utilized generative replay models to couple sequential tasks in a latent embedding space. However, generative approaches have succeeded in artificial and simple datasets, but have failed in complex vision tasks, e.g. object detection. Based on the neuroscience evidence of the bidirectional interaction between the hippocampus and the neocortex dudek1993bidirectional , we leverage slow learners to exploit unlabeled video frames and generate pseudo labels for training fast learners. These pseudo label replays encourage fast learners to capture more generic representations from diverse data of replay buffers and unlabeled video frames; hence, in turn, contributing to reinstatement of memory in slow learners, resulting in a positive feedback loop.

2.2 Online Continual Learning (OCL)

In contrast to classical Continual Learning (CL) where data are separated by task boundaries and models are trained with multiple iterations in every task, we examine a more realistic and challenging problem where data are provided in tiny batches and models are trained on these batches only once. OCL has gained increasing interests recently in computer vision

aljundi2019gradient ; caccia2022new ; shim2021online ; chen2020mitigating ; wang2021wanderlust . Many OCL methods rely on representative memory replays. Aljundi et al. aljundi2019gradient utilized gradients of network parameters to select replay samples of maximum diversity. Subsequent works aljundi2019online ; shim2021online proposed to use losses and scoring functions as criteria for selecting the most representative samples for replays. However, these approaches tackled image classification problem in an artificial setting, where new classes appear in a specific order. Their performance in real-world vision tasks remains unclear. Lately, Wang et al. wang2021wanderlust benchmarked OCL methods in the real-world setting with full supervisions. As the video streams arrive endlessly in a real-time manner, assigning annotations to all the video frames for training computational models is laborious and time-consuming. It becomes even more daunting in object detection tasks where class labels and bounding boxes of all objects on a video frame have to be provided. Reducing burdensome costs of labeling remains an under-explored and challenging problem in OCL. We propose a self-sustaining Efficient-CLS. Our model is capable of exploiting the unlabeled video frames by pseudo-labeling when the number of labeled frames is limited.

2.3 Semi-Supervised Object Detection (SSOD)

To reduce annotation costs in object detection, several methods jeong2019consistency ; sohn2020simple ; liu2021unbiased capitalize the teacher-student networks. In general, a teacher model predicts pseudo labels or enforces a consistency loss to guide the student networks. Recently, Liu et al. liu2021unbiased proposed to use Exponential Moving Average (EMA) to update the teacher model. It is worth noting that these previous works on SSOD mainly consider the offline setting on static image datasets, while none of them has been extended to online video streams.

3 Efficient-CLS: Efficient Complementary Learning System

We consider the online continual object detection on a continuum of video streams where at time step , a learning agent receives a mini-batch of continuous video frames from current environment for online training (one single pass). To perform label-efficient object detection, within the batch , only a subset of video frames are labeled, while the remaining video frames are unlabeled. For each labeled data sample, its annotation contains the bounding box locations and their corresponding class labels.

Efficient-CLS consists of two feed-forward modules: (i) the fast learner is designed to quickly encode new knowledge from current data stream and then consolidate it to the slow learner; and (ii) the slow learner accumulates the acquired knowledge from fast learner over time and guides the fast learner with meaningful pseudo labels, when full supervision is not available. Following rebuffi2017icarl , we maintain an external episodic memory, as a replay buffer, to store exemplars that can be retrieved for replays alongside ongoing video stream.

Figure 2: The overview of our Efficient-CLS. At each learning step, the system receives a batch of temporally continuous data , including labeled (green) and unlabeled (orange) frames. The fast learner trains the labeled frames alongside a small subset of labeled exemplars retrieved from episodic memory with the supervised loss . Meanwhile, the fast learner leverages the pseudo labels generated by the slow learner to optimize a pseudo loss . To reinstate memory of the slow learner, the synaptic weights of the slow learner are updated by taking the Exponential Moving Average (EMA) of the fast learner’s weights. The fast and slow learners are complementary to each other, forming a positive feedback loop.

3.1 Learning with Labeled Frames

The fast learner and slow learner use the same standard Faster-RCNN ren2015faster detector . Despite the same architecture, the weights of the fast and slow learners are not shared. We use and to denote the network parameters for fast and slow learners respectively. As shown in Figure 2, at each training step , we use the labeled video frames to optimize the fast learner with the standard supervised loss in Faster-RCNN ren2015faster . It consists of four losses: Region Proposal Network (RPN) classification loss , RPN regression loss , Region of Interest (ROI) classification loss , and ROI regression loss . We define as:

(1)

3.2 Learning with Unlabeled Frames

We introduce a pseudo-labeling paradigm to capitalize the information from unlabeled video frames for training. In our early exploration, we intuitively use the fast learner for pseudo-labeling as it quickly adapts the knowledge of nearby frames. However, we observe that using the pseudo labels generated by the fast learner for self-replay exhibits biases towards recently seen objects, which is less effective in preventing forgetting. This has also been verified in our ablation study (Section 5.3). In contrast, the slow learner preserves the semantic knowledge over a longer time span which generates pseudo labels with fewer biases. This encourages the fast learner to capture more generic scene representations, hence, in turn, contributing to reinstatement of memory in the slow learner (Section 3.3), resulting in a positive feedback loop.

Given all these design considerations, the slow learner takes the unlabeled video frames

as inputs to estimate the possible objects of interest and their corresponding bounding box locations. For brevity, we refer these "pseudo bounding boxes and their corresponding class labels" as "pseudo labels" in the paper. To get rid of false positives, we apply a threshold

to filter out bounding boxes with predicted low confidence scores. Moreover, there also exist repetitive boxes which negatively impact the quality of pseudo-labeling. To address this issue, we use the technique of class-wise non-maximum suppression (NMS) ren2015faster to remove the overlapped boxes and get the high-quality pseudo labels. Formally, the procedure of pseudo label generation is summarized below:

(2)

where denotes the bounding box selection with confidence score larger than .

Given that the video streams are captured from the egocentric perspective in the real world, head and body motions may lead to undesired motion blur effects on some video frames. To enforce our model to learn invariant object representations from these video frames, same as the previous work zoph2020learning , we apply data augmentation techniques on the pseudo-labeled frames, including 2D image crops, rotations, and flipping. Note that different from image classification, the predicted bounding box locations also need to be updated accordingly after image augmentations. We denote these pseudo-labeled video frames and their re-adjusted pseudo labels after data augmentations as (). We can then use these pseudo pairs () to train the fast learner by optimizing the pseudo loss Note that we only apply pseudo losses at the ROI module, as we empirically verified that the RPN module has no effects on pseudo training (see Section A.2.4).

Overall, our Efficient-CLS is jointly trained with the following losses: , where is the weight of .

3.3 Synapses Consolidation via Exponential Moving Average

To alleviate forgetting of obtained knowledge, we apply EMA to gradually update the slow learner with the fast learner’s synaptic weights. The evolving synaptic changes in the slow learner are functionally correlated with the memory consolidation mechanism in the hippocampus and the neocortex arani2022learning . Formally, we define EMA process as:

(3)

where the is EMA rate. According to the stability-plasticity dilemma, a smaller means faster adaption but less memorization. Empirically, we set , which leads to best performance (see Section A.2.2 for detailed analysis on choices of ).

4 Experimental details

4.1 Datasets

We consider two challenging datasets, i.e., OAK wang2021wanderlust and EgoObjects grauman2021ego4d for online continual object detection on video streams.

OAK dataset wang2021wanderlust is a large egocentric video stream dataset spanning nine months of a graduate student’s life, consisting of 7.6 million frames of 460 video clips with a total length of 70.2 hours. The dataset contains 103 object categories. We follow wang2021wanderlust in the ordering of training and testing data splits. One frame every 16 consecutive video frames lasting for 30 seconds is held out to construct a test set and the remaining frames are used for training.

EgoObjects 111https://sites.google.com/view/clvision2022/challenge is one of the largest object-centric datasets focusing on object detection task. It includes 40,000 videos (around 110 hours), covering 600 object categories. We take a subset of EgoObjects to benchmark LEOCOD (see Section A.1.3 for details). For consistency, we use the same ordering above as OAK dataset to construct the train and test data splits.

4.2 Evaluation

Protocol. First, we define the annotation cost as the proportion of number of labeled frames versus the total 16 frames within a mini-batch . For example, if 2 out of 16 consecutive frames within get labeled, the annotation cost is . The frames to be labeled are randomly selected within each mini-batch . Considering that different choices of labeled frames might influence the computational model performance, for fair comparisons between models, we fix the choice of randomly selected labeled frames and use the same labeled and unlabeled frames for training all models.

Based on the various annotation costs, we introduce two training protocols: fully supervised protocol (100% annotation cost) and sparse annotation protocol (where the annotation cost is less than 100%). In Sparse Annotation protocol, we further split the training experiments based on 50%/25%/12.5%/6.25% annotation costs.

We use the same test set for evaluating computational models. As shown in Figure 1, we always add the last video frame out of every 16 video frames within a mini-batch to our test set. Once the test set is constructed for each dataset, it is fixed. All the frames in the test set are repetitively used for evaluating computational models at every 100 learning steps. It is possible that some test frames might contain unseen object classes, where the model has not yet learnt to detect in the current learning step. In this case, we follow the same evaluation paradigm from the previous work wang2021wanderlust . As the model continuously learns from the ongoing video streams, the model gets to learn to detect more object classes. In the end of the video stream. the model should be able to detect all object classes present in the test set.

Metrics. We evaluate these baselines on OAK and EgoObjects datasets with three standard metrics: continual average precision (CAP), final average precision (FAP) and Forgetfulness (F) wang2021wanderlust . CAP shows the average performance of a continual learning algorithm over the time span of the entire video stream, while FAP denotes the final performance of a model after seeing the entire video stream. F estimates the forgetfulness of the model due to the sequential training. It takes into account the time interval between the first presence of an object category and its subsequent presence. See Appendix (Section A.1.2) for their detailed definitions.

4.3 Baselines

We compare our model against the following baselines: Incremental is a naive baseline trained sequentially over the entire video stream without any measures to avoid catastrophic forgetting; EWC kirkpatrick2017overcoming is a weight-regularization method which prevents forgetting by penalizing the changes of important parameters in previous tasks; iCaRL rebuffi2017icarl is a replay method where old video frames are stored in a replay buffer and get replayed when a model learns to detect objects on new video frames; Offline Training is an upper bound. It trains the entire data stream over multiple epochs.

The iCaRL model implemented by Wang et al. wang2021wanderlust stands as the state-of-the-art method in online continual object detection. We reproduce their results using the released code222https://github.com/oakdata/benchmark. When calculating RPN and ROI losses for replay samples, their iCaRL model neglects the losses of background proposals and penalizes the foreground losses according to the proportion of the current samples and replay samples. We empirically find that this trick hinders the model from effective video frame replay, thus resulting in severe forgetting. Therefore, we re-implement the iCaRL by discarding the re-weighting trick used in wang2021wanderlust and reverting back to the standard RPN and ROI losses. To distinguish these two different implementations, we name the Wang et al.’s version as iCaRL(Wang et al.) and ours as iCaRL(our impl.).

4.4 Implementation Details

For fair comparisons, same as wang2021wanderlust , we use pre-trained Faster-RCNN ren2015faster with ResNet-50 backbone he2016deep

on PASCAL VOC

everingham2015pascal for all the continual learning algorithms. Our replay buffer stores total 5 samples per class (around 500 frames for OAK, and 1400 for EgoObjects). This buffer size is comparable with the buffer size in iCaRL(Wang et al.) wang2021wanderlust and iCaRL(our impl.). We also fix the number of replay samples to 16 frames per time step, which requires less training time compared with wang2021wanderlust . We use confidence threshold to generate pseudo-labels and apply as the EMA rate for all our experiments. For our Efficient-CLS, we use the output of the slow learner at the inference stage, as it excels at avoiding catastrophic forgetting (see Section 5.3). More training and implementation details can be found in Appendix (Section A.1.1).

5 Results

OAK EgoObjects
Annotation Cost FAP () CAP () F () FAP () CAP () F ()
Incremental* 100% 8.38 7.72 0.03 10.21 3.55 1.48
EWC* 100% 7.73 7.02 -0.12 5.15 1.60 0.57
iCaRL(Wang et al.) 100% 22.89 16.60 -2.95 37.61 21.71 2.79
iCaRL(our impl.) 100% 36.14 26.26 -4.89 60.80 36.41 -0.60
Ours 12.5% 33.92 23.04 -7.71 53.33 32.88 -2.92
25% 38.36 26.64 -8.20 61.26 39.58 -3.48
100% 40.24 28.18 -8.10 67.05 40.36 -3.67
Offline Training 100% 48.28 35.23 - 86.18 59.81 -
Table 1: Perfomance of Efficient-CLS and other state-of-the-art methods on OAK and EgoObjects. iCaRL(Wang et al.) denotes the model presented in wang2021wanderlust , and iCaRL(our impl.) is the same method by our implementation. The best results are bold-faced. *The results on OAK are lower than the ones presented in wang2021wanderlust due to different training settings, where wang2021wanderlust trains the data of each step by 10 epochs but in our strictly online setting we only see each step once.

5.1 Performance in Fully Supervised Protocol

As the previous work wang2021wanderlust focuses on online continual object detection in video streams, we first evaluated model performance in fully supervised setting, where all video frames are paired with ground truth labels. We reported the results in the standard metrics (CAP, FAP, and F, Section 4.2) in Table 1. Our proposed method, Efficient-CLS, surpasses all the existing baselines.

An ideal video stream learning method should avoid catastrophic forgetting, while adapting to the new tasks. A trivial algorithm that only adapts to the current task without any measures to prevent catastrophic forgetting leads to the lowest FAP and CAP scores, and largest Forgetting (F) values. Indeed, Incremental is a lower bound among all computational models. Previous works have shown weight-regularization methods are less effective than replay methods in many continual learning tasks zhang2021hypothesis ; yoon2021online , we included the weight-regularization method EWC for comparison. We observed that EWC is inferior to other competitive baselines in stream learning. Its performance is not significantly different from Incremental on OAK and even worse than Incremental on EgoObjects. It is possible that EWC has to utilize task boundaries to compute the weight importance, which is not applicable in video stream learning. During our implementation, we also noted that the running time of EWC is significantly longer than our method. One reason is that calculating Fisher Information Matrix to select the important weights in EWC is computationally expensive.

iCaRL(Wang et al.) outperforms EWC by 15.16% in FAP, 9.58% in CAP and 2.83% in F on OAK dataset. A significant performance boost is also consistently observed in EgoObjects in terms of FAP and CAP. This demonstrates that a naive image replay strategy can still play an important role in stream learning. We introduced several variations to the original design of iCaRL(Wang et al.) (see Section 4.3). Compared with iCaRL(Wang et al.), we observed a performance boost of 13.25% in FAP, 9.66% in CAP, and 1.94% in F for iCaRL(our impl.) on OAK dataset. Since semantic contextual information is more important in indoor environments on EgoObjects compared to the outdoor environments in OAK dataset, we noticed that the improvement of iCaRL(our impl.) is even greater on EgoObjects dataset with an increase of 23.19% in FAP, 14.70% in CAP, and, 3.39% in F.

Inheriting from the benefit of image naive replays, our Efficient-CLS replays the images stored in the episodic memory buffer. In addition, we introduced a strategy of synaptic weight transfer from fast to slow learner. Our performance in fully supervised setting beats iCaRL(our impl.)

by 4.10% and 1.92% in terms of FAP and CAP, respectively, and favorably reduces the Forgetfulness (F) by 3.21% on OAK dataset. Our model also consistently leads among all comparative methods in all evaluation metrics on EgoObjects dataset. It is worth noting that

Offline Training provides the upper bound performance when the model is trained with the entire video streams multiple times. Despite a marginal performance sacrifice of 8.04% in FAP and CAP compared with Offline Training, our method significantly reduces the retraining time by 40 times, and the storage sizes by 50 times.

Figure 3: Evaluation of online continual object detection in video streams with three metrics (FAP, CAP and F, Section 4.2 ) on OAK dataset (first row) and EgoObjects dataset (second row). The higher the bars are, the better. The x-axis denotes the percentage of video frames that are labeled in the video stream. It ranges from 6.25% to 100% (full supervision). The y-axis indcates the performance using different evaluation metrics. Our method (red) consistently beats the SOTA (iCaRL(our impl.), blue) in all evaluation metrics.

5.2 Performance in Sparse Annotation Protocol

The sparse annotation protocol is more challenging than the previous fully supervised protocol as shown by the performance differences when number of annotated video frames decreases (compare the performance of each colored bar along the x-axis within each subplot in Figure 3). We compared Efficient-CLS with the best previous method iCaRL(our impl.) and reported the performance in Figure 3. In both OAK and EgoObjects datasets, Efficient-CLS consistently beats the SOTA approaches in all three evaluation metrics regardless of various degrees of annotation costs. Thanks to the useful information from pseudo labels predicted by the slow learner in our Efficient-CLS, our method is more robust to various annotation costs, compared with iCaRL(our impl.) (compare the rate of change of blue bars vs. red bars over different degrees of annotation costs). Most remarkably, Efficient-CLS with 25% annotation costs has already outperformed iCaRL(our impl.) with 100% annotation costs (compare the red bar at 25% with the blue bar at 100% within each subplot).

5.3 Ablation Study

We assessed the importance of design choices by evaluating ablated versions of our Efficient-CLS model in fully supervised learning protocol (Table 3) and sparse annotation protocol (Table 3). The complementary learning system design in Efficient-CLS is the key for rapidly adapting to learn new tasks, meanwhile, retaining previously learnt knowledge. It constitutes of two memory reinstatement mechanisms: one is synaptic weight transfer from fast to slow learner via exponential moving average (EMA); and the other is pseudo-labeling and replays from slow learner to fast learner. Here we ablated individual mechanism and studied their effect on OAK dataset.

EMA
FAP () CAP () F ()
36.14 26.26 -4.89
40.24 28.18 -8.10
Table 3: Effectiveness of Exponential Moving Average (EMA) and Pseudo-labeling on OAK dataset at annotation cost 12.5%. The best results are bold-faced.
EMA
Pseudo-labeling FAP () CAP () F ()
28.76 19.80 -5.48
31.72 21.16 -7.24
31.60 22.44 -4.83
33.92 23.04 -7.71
Table 2: Effectiveness of Exponential Moving Average (EMA) on OAK dataset in fully supervised protocol. The best results are bold-faced.

Effect of Exponential Moving Average (EMA). We removed EMA by setting to 1 in Equation 3, where the model weights of the fast learner and slow learner are now shared throughout the learning process. From Table 3 in the fully supervised protocol, we observed that removing EMA leads to a performance drop of 4.10% in FAP, 1.92% in CAP and 3.21% in F on OAK dataset (compare Row 2 vs. Row 1). This shows that the slow learner can effectively consolidate the knowledge from the fast learner, and constructively alleviate catastrophic forgetting by synapses consolidation over time. Given the fact that the slow learner is better at preventing forgetting than the fast learner, we used the output from the slow learner at the inference stage. Similar observations were made on Table 3 in the sparse annotation protocol (compare Row 4 vs. Row 3). It is worth noting that the performance difference between our full model (Row 4) and Ablated EMA (Row 3) is slightly larger in fully supervised protocol than the sparse annotation protocol (compare Table 3 vs. Table 3). One reason is that compared with sparse annotation protocol, the fast learner achieves better performance in fully supervised training; hence, the effect of removing EMA becomes stronger in the fully supervised setting, again highlighting the importance of EMA.

To further explore the role of EMA, we varied from 0.5 and 0.999 and reported their performance in Section A.2.2. We observed that the choice of is relatively insensitive to the performance. For example, Efficient-CLS with leads to an performance increase of less than 1% in FAP and CAP and 1.12% in F, compared with the case of . However, we did observe a huge performance drop when is very close to 100%, where there is almost no synaptic weight transfer between slow and fast learner.

Effect of the Pseudo-labeling. We ablated our Efficient-CLS by removing the pseudo-labeling and replays of the slow learner at 12.5% annotation costs (Ablated Pseudo, Row 2) and reported the results in Table 3. Ablated Pseudo leads to a performance drop of 2% in FAP and CAP and 0.5 in F, compared with our full Efficient-CLS (Row 4). It implies that the slow learner captures useful semantic information from unlabeled video frames and these predicted pseudo labels are helpful in training the fast learner.

To investigate whether pseudo labels predicted by the fast learner could help stream learning, we conducted another ablation experiment where we removed both EMA and pseudo-labeling mechanisms in Efficient-CLS, dubbed as Naive Ablated (Row 1, Table 3). Compared with Ablated EMA (Row 3), we observed a performance drop from 31.60% to 28.76% in FAP and 22.44% to 19.80% in CAP. It indicates that pseudo labels predicted by the fast learner can serve as an informative supervision for the training of the fast learner itself. However, replaying the self-predicted pseudo labels on the fast learner fails to prevent forgetting, as indicated by the drop from -5.48% to -4.83% in F. It is possible that the pseudo labels generated by the fast learner only bias towards the classes which have already been learnt very well and fail to reinforce the fast learner to improve on the poorly-learnt classes. Different from the fast learner, the slow learner integrates semantic information over time. The predicted pseudo labels carry more semantic information, which is useful for fast learner to capture more generic object representations during pseudo label replays. Again, this emphasizes that the reciprocal replay from the slow learner to the fast learner is critical for memory reinstatement, which has been missing in the computational modeling literature of CLS.

6 Conclusion

To imitate what humans see and learn in the real world, we introduced a more realistic and challenging problem on label-efficient online continual object detection in video streams. Addressing this problem would greatly benefit real-world applications by reducing model retraining time and data annotation costs. Inspired by the complementary learning systems (CLS) in human brains, we proposed Efficient-CLS. Just like humans, it is capable of learning to detect objects continuously from both fully and sparsely annotated video streams, while retaining previously learnt knowledge. We rigorously evaluated Efficient-CLS and competitive baselines on two challenging real-world video stream datasets. We verified the effectiveness of our method in reducing annotation costs and avoiding catastrophic forgetting. Although our Efficient-CLS only capitalizes on 25% annotations, it beats all comparative models requiring fully supervised training on all video streams. Despite the promising results of our Efficient-CLS, we still observed a large performance gap between Efficient-CLS and the object detection algorithms trained in the offline setting, highlighting the challenges in our introduced LEOCOD problem.

References

Appendix A Appendix

a.1 Additional Experimental Details

a.1.1 Training

For a fair comparison, we follow the prior work wang2021wanderlust to use Faster-RCNN ren2015faster with ResNet-50 backbone he2016deep as our object detection network, which is initialized by the weights pre-trained on PASCAL VOC everingham2015pascal . We use Adam optimizer with a learning rate 0.0001, and the batch size is set to 16 frames. Same as wang2021wanderlust , we maintain a replay buffer with 5 samples per class. At each time step , we first randomly retrieve 16 video frames from the replay buffer for joint training. We use confidence thresh to generate pseudo-labels for unlabeled frames. Data augmentation, including random image crops, rotations, and horizontal flip, is applied on these pseudo-labeled frames. We introduce as a hyper-parameter to balance the contribution of two losses and

. After updating the weights of the fast learner via backpropagation of the incurred losses, we update the slow learner by taking the EMA of the fast learner’s weights with an EMA rate

. Finally, the replay buffer is updated with the labeled frame at current time step . Each model is trained by a single pass over the entire video stream. The training is carried out on 2 NVIDIA RTX 3090 GPUs.

a.1.2 Evaluation Metrics

Following wang2021wanderlust , we evaluate all the methods with three standard metrics: continual average precision (CAP), final average precision (FAP) and forgetfulness (F). We adopt AP50, i.e., the average precision (AP) at , as the measurement of AP.

CAP shows the average performance of a continual learning algorithm over the time span of the entire video stream. As shown in Figure 1, the model is evaluated on the test set every 100 time steps. At evaluation step, the reported is defined as

(S1)

where is the average precision (AP) of the class on the test set. is then defined as the average values over all the evaluation steps:

(S2)

where is the total evaluation steps.

FAP is the final performance of a model after seeing the entire video. That is, , where denotes the last evaluation step.

F estimates the forgetfulness of the model due to the sequential training. It takes into account the time interval between the presence of an object category and its subsequent presence. For a class , we sort the according to the time interval between evaluation time and the last time the model is trained on . After is sorted, all are divided into bins according to the time interval . The average () of each bin is defined as the model’s performance for detecting class after the model has not been trained on for time steps. The forgetfulness (F) of the class is defined as the weighted sum of the performance decrease at each time:

(S3)

The overall forgetfulness is then defined as:

(S4)

a.1.3 Datasets

OAK. We follow wang2021wanderlust in the ordering of training and testing data splits, i.e., one frame every 16 consecutive video frames is held out to construct a test set and the remaining frames are used for training. However, as the original test set curated by wang2021wanderlust is not publicly available, we re-split the training and testing data using the video streams from the original training set. The model trained and evaluated on our dataset shows comparable results with the original one.

EgoObjects. The original data can be downloaded from this website333https://sites.google.com/view/clvision2022/challenge. This dataset consists of 6076 videos taken in 1110 realistic indoor environments (around 6 videos per environment). The videos in each environment contain the same objects but feature a great variety of lighting conditions, scale, camera motion, and background complexity. We first downsample the original videos by 2 frames to make the length of the entire video stream comparable with OAK dataset. We shuffle the ordering of 6076 videos (not the video frames from the same video) from different environments to make it more realistic as the previously seen environments are allowed to be revisited in real-world. We concatenate the videos as one long video stream, and use the same ordering as OAK to construct the train and test data splits.

a.2 Additional Ablation Study

In addition to the ablation studies provided in the main paper, we further study the effectiveness of each component in our proposed Efficient-CLS in the following sections.

a.2.1 Effect of EMA and Pseudo-labeling

In the main text, we studied the effect of EMA and pseudo-labeling at 12.5% annotation costs (Section 5.3). Here we performed the same experiments at annotation costs 6.25% (Table S1), 25% (Table S2), and 50% (Table S3). We found that, by integrating the EMA and pseudo-labeling, our Efficient-CLS (4th row) improves the state-of-the-art model (1st row) and other ablated models (2nd row and 3rd row) by a significant margin. Please refer to Section 5.3 for more analysis.

EMA
Pseudo-labeling FAP () CAP () F ()
23.04 17.75 -3.31
27.84 20.03 -3.96
26.39 19.50 -1.99
29.72 20.31 -5.36
Table S1: Effectiveness of Exponential Moving Average (EMA) and Pseudo-labeling on OAK dataset at annotation cost 6.25%. The best results are bold-faced.
EMA
Pseudo-labeling FAP () CAP () F ()
33.70 24.57 -4.30
34.79 25.62 -4.35
34.95 25.65 -3.65
38.36 26.64 -8.20
Table S2: Effectiveness of Exponential Moving Average (EMA) and Pseudo-labeling on OAK dataset at annotation cost 25%. The best results are bold-faced.
EMA
Pseudo-labeling FAP () CAP () F ()
34.68 25.78 -4.15
35.74 25.77 -4.82
35.61 25.56 -3.76
38.61 26.90 -7.29
Table S3: Effectiveness of Exponential Moving Average (EMA) and Pseudo-labeling on OAK dataset at annotation cost 50%. The best results are bold-faced.

a.2.2 Effect of EMA Rates

As mentioned in Section 5.3, we varied from 0.5 and 0.999 and presented their performance in Table S4. We observed that the choice of is relatively insensitive to the performance. For example, Efficient-CLS with leads to an performance increase of less than 1% in FAP and CAP and 1.12% in F, compared with the case of . However, we did observe a huge performance drop when is very close to 1.0, where there is almost no synaptic weight transfer between slow and fast learner.

0.5 0.9 0.95 0.99 0.995 0.999
FAP () 36.93 37.57 38.25 40.24 40.59 33.02
CAP () 26.70 28.35 28.61 28.18 27.11 15.15
F () -6.66 -5.54 -6.20 -8.10 -9.70 -5.72
Table S4: Ablation study of varying EMA rates on OAK dataset in fully supervised protocol.

a.2.3 Effect of Pseudo-labeling Threshold

As mentioned in Section 3.2, we apply confidence thresholding to remove predicted bounding boxes that have low confidence scores. To show the effectiveness of thresholding, we varied the confidence threshold from 0.1 to 0.9 (see Table S5). We observed that the model using a high threshold (e.g., 0.7) yields satisfactory results, as it produces more reliable pseudo-labels with high confidence. On the other hand, using a low threshold can result in lower performance since the model generates too many bounding boxes, which are likely to be false positives.

0.1 0.3 0.5 0.6 0.7 0.8 0.9
FAP () 30.33 31.16 32.17 32.54 33.92 32.81 32.07
CAP () 21.45 21.67 22.38 22.51 23.04 22.69 22.70
F () -7.11 -7.29 -6.88 -6.98 -7.71 -6.60 -6.94
Table S5: Ablation study of varying confidence threshold at annotation cost 12.5%.

a.2.4 Effect of RPN Loss in Pseudo Training

In Section 3.2, we mentioned that pseudo losses are only applied at the ROI module but not at the RPN module. As shown in Table S6, the model with and without RPN loss in training pseudo-labeled frames show similar performance. We assumed that the RPN module is less likely to suffer catastrophic forgetting since its primary function is to produce general proposals that are class agnostic. As a result, we removed the RPN loss during pseudo training, which also reduces the overall computational cost.

RPN Loss FAP () CAP () F ()
33.92 23.04 -7.71
33.64 22.68 -7.62
Table S6: Performance of our Efficient-CLS with and without RPN loss in pseudo training on OAK dataset at annotation cost 12.5%. The best results are bold-faced.

a.2.5 Effect of Pseudo Loss Weights

As mentioned in Section 3.2,

is a hyperparameter balancing the importance of supervised loss (

) and pseudo loss (). To examine the effect of , we varied the from 0.5 to 4.0 at annotation cost 12.5% on OAK dataset. As shown in Table S7, the model performs the best with and shows moderate performance drop for other values of (0.5, 1.5, and 2.0). However, when is set to 4.0, the model performance deteriorates.

0.5 1.0 1.5 2.0 4.0
FAP () 33.19 33.92 33.01 32.67 30.08
CAP () 22.92 23.04 22.52 22.02 20.17
F () -6.55 -7.71 -7.71 -7.86 -7.50
Table S7: Ablation study of varying pseudo loss weights at annotation cost 12.5%.

a.2.6 Effect of Data Augmentation in Pseudo Training

As mentioned in Section 3.2, we use data augmentation techniques when training the pseudo-labeled frames. Here we ablated our Efficient-CLS by removing data augmentation in pseudo training. From Table S8 at annotation cost 12.5%, we observed that removing data augmentation in pseudo training leads to a performance drop of 3.60% in FAP, 1.96% in CAP and 1.21% in F on OAK dataset. This indicates that using data augmentation on pseudo-labeled frames can enforce the model to learn invariant object representations from these video frames.

Data Augmentation FAP () CAP () F ()
30.32 21.08 -6.50
33.92 23.04 -7.71
Table S8: Effectiveness of Data Augmentation in Pseudo Training on OAK dataset at annotation cost 12.5%. The best results are bold-faced.

a.3 Analysis of Unlabeled Frames Selection

See caption in Table S9.

Annotation Cost (%) 50 25 12.5 6.25
FAP () 38.45 (0.68) 38.00 (1.17) 34.29 (0.76) 30.50 (1.07)
CAP () 26.85 (0.24) 26.38 (0.32) 23.47 (0.73) 20.60 (0.43)
F () -8.01 (0.77) -8.32 (0.98) -7.30 (0.82) -6.28 (0.69)
Table S9: Performance of our Efficient-CLS on OAK dataset in sparse annotation protocol.

The table header denotes the percentage of frames that are labeled in the video stream. We conducted each experiment with 5 runs. Each run has a different random seed. The means and standard deviations in brackets are reported. We find that our Efficient-CLS shows reliable and robust performance against different selections of unlabeled frames in the video stream.

a.4 Analysis of AP Changes over Time

See caption in Figure S1.

Figure S1: The changes of with sampled categories on OAK dataset in fully supervised protocol. The x-axis denotes the time step across the entire video stream. The y-axis denotes the AP50 of the category at specific time step (i.e., ). The grey line indicates the existence of the category. Our method (light blue) consistently outperforms existing approaches with minimal forgetting even when categories appear infrequently (e.g., sculpture) and exhibits the closest gap against the upper bound (Offline).

a.5 Analysis of Inference Model

See caption in Figure S2.

Figure S2: The changes of on OAK dataset at different annotation costs. The x-axis denotes the time steps across the entire video stream. The y-axis denotes the AP50 at specific time step (i.e., ). This plot shows that the slow learner (orange) of our Efficient-CLS is better at preventing forgetting than the fast learner (blue), hence we used the slow learner at inference stage (Section 5.3).

a.6 Visualization of Pseudo-labeling

See caption in Figure S3.

Figure S3: Visualization of example pseudo-labels predicted by our Efficient-CLS and the Naive Pseudo-labeling. The white box with dash line denotes the ground truth label. The box with solid line denotes the pseudo-labels (the ones in green are correct while the red are wrong labels). The Naive Pseudo-labeling only has one learner and uses the pseudo-labels generated by itself for training. This plot shows that the pseudo-labels generated by our Efficient-CLS (2nd column) capture more ground truth objects and contain fewer false positive instances than the Naive Pseudo-labeling model (1st column).