Humans have the remarkable ability to learn continuously from the external environment and the inner experience. One of the grand goals of robots is also building an artificial “lifelong learning” agent that can shape a cultivated understanding of the world from the current scene and their previous knowledge via an autonomous lifelong development. It is challenging for the robot learning process to retain earlier knowledge when they encounter new tasks or information. Recent advances in computer vision and deep learning methods have been very impressive due to large-scale datasets, such as ImageNet and COCO . However, robotic vision poses unique new challenges for applying visual algorithms developed from these computer vision datasets because they implicitly assume a fixed set of categories and time-invariant task distributions . Semantic concepts change dynamically over time [22, 19, 21]. Thus, sizeable robotic vision datasets collected from real-time changing environments for accelerating the research and evaluation of robotic vision algorithms are crucial. For bridging the gap between robotic vision and stationary computer vision fields, we utilize a real robot mounted with multiple-high-resolution sensors (e.g., monocular/RGB-D from RealSense D435i, dual fisheye images from RealSense T265, LiDAR,, see Fig. 1) to actively collect the data from the real-world objects in several kinds of typical scenarios, like homes, offices,campus, and malls.
Lifelong learning approaches can be divided into 1) methods that retrain the whole network via regularizing the model parameters learned from previous tasks, e.g., Learning without Forgetting (LwF) , Elastic Weight Consolidation (EWC)  and Synaptic Intelligence (SI) ; 2) methods that dynamically expand/adjust the network architecture if learning new tasks, e.g., Context-dependent Gating (XdG)  and Dynamic Expandable Network (DEN) 
; 3) rehearsal approaches gather all methods that save raw samples as memory of past tasks. These samples are used to maintain knowledge about the past in the model and then replayed with samples drawn from the new task when training the model, e.g., Incremental Classifier and Representation Learning (ICaRL); and generative replay approaches train generative models on the data distribution [27, 25, 26], and they are able to afterward sample data from experience when learning new data, e.g., Deep Generative Replay (DGR) , DGR with dual memory  and feedback .
This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition challenge) with dataset, rules, methods and results from the top finalists (out of over teams). Individual reports, dataset information, rules, and released source codes can be found at the competition homepage.
Ii IROS 2019 Lifelong Robotic Vision - Object Recognition Challenge
This challenge aimed to explore how to leverage the knowledge learned from previous tasks that could generalize to new task effectively, and also how to efficiently memorize of previously learned tasks. The work pathed the way for robots to behave like humans in terms of knowledge transfer, association, and combination capabilities.
To our best knowledge, the provided lifelong object recognition dataset OpenLORIS-Object-v  is the first one that explicitly indicates the task difficulty under the incremental setting, which is able to foster the lifelong/continual/incremental learning in a supervised/semi-supervised manner. Different from previous instance/class-incremental task, the difficulty-incremental learning is to test the model’s capability over continuous learning when faced with multiple environmental factors, such as illumination, occlusion, camera-object distances/angles, clutter, and context information in both low and high dynamic scenes.
Ii-a OpenLORIS-Object Dataset
IROS 2019 competition provided the version of OpenLORIS-Object dataset for the participants. Note that our dataset has been updated with twice the size in content available at the project homepage with detailed information,visualization, downloading instructions and benchmarks on SOTA lifelong learning methods .
We included the common challenges that the robot is usually faced with, such as illumination, occlusion, camera-object distance, etc. Furthermore, we explicitly decompose these factors from real-life environments and have quantified their difficulty levels. In summary, to better understand which characteristics of robotic data negatively influence the results of the lifelong object recognition, we independently consider: 1) illumination, 2) occlusion, 3) object size, 4) camera-object distance, 5) camera-object angle, and 6) clutter.
Illumination. The illumination can vary significantly across time, e.g., day and night. We repeat the data collection under weak, normal, and strong lighting conditions, respectively. The task becomes challenging with lights to be very weak.
Occlusion. Occlusion happens when a part of an object is hidden by other objects, or only a portion of the object is visible in the field of view. Occlusion significantly increases the difficulty for recognition.
Object size. Small-size objects make the task challenging, like dry batteries or glue sticks.
Camera-object distance. It affects actual pixels of the objects in the image.
Camera-object angle. The angles between the cameras and objects affect the attributes detected from the object.
Clutter. The presence of other objects in the vicinity of the considered object may interfere with the classification task.
|Level||Illumination||Occlusion (percentage)||Object Pixel Size (pixels)||Clutter||Context||#Classes||#Instances|
The version of OpenLORIS-Object for this competition is a collection of instances, including categories daily necessities objects under scenes. For each instance, a seconds video (at fps) has been recorded with a depth camera delivering RGB-D frames (with distinguishable object views picked and provided in the dataset). environmental factors, each has level changes, are considered explicitly, including illumination variants during recording, occlusion percentage of the objects, object pixel size in each frame, and the clutter of the scene. Note that the variables of 3) object size and 4) camera-object distance are combined together because in the real-world scenarios, it is hard to distinguish the effects of these two factors brought to the actual data collected from the mobile robots, but we can identify their joint effects on the actual pixel sizes of the objects in the frames roughly. The variable 5) is considered as different recorded views of the objects. The defined three difficulty levels for each factor are shown in Table. I (totally we have levels w.r.t. the environment factors across all instances). The levels , , and are ranked with increasing difficulties.
For each instance at each level, we provided samples, both have RGB and depth images. Thus, the total images provided is around (RGB and depth) (samples per instance) (instances) (factors per level) (difficulty levels) = images. Also, we have provided bounding boxes and masks for each RGB image with Labelme . The size of images under illumination, occlusion and object pixel size factors is 424240 pixels, and the size of images under object pixel size factor are 424240, 320180, 1280720 pixels (for difficulty levels). Picked samples have been shown in Fig. 2.
Ii-B Challenge Phases and Evaluation Rules
We held phases for the challenge. The preliminary contest we provided batches of datasets which contain different factors and difficulty levels, for each batch, we have train/validation/test data splits. The core of this incremental learning setting is, we need the first train on the first batch of the dataset, and then batch, batch, until the batch, and then use the final model to obtain the test accuracy of all encounter tasks (batches). The training/validation datasets can only be accessed during the model optimizations. We held the evluation platform on Codalab. There had been over over participants during the preliminary contest and we chose teams with higher testing accurries over all testing batches as our finalists.
For the final round, different from standard computer vision challenge [1, 10], not only the overall accuracy on all tasks was evaluated but also the model efficiency, including model size, memory cost, and replay size (the number of old task samples used for learning new tasks, smaller is better) were considered. Meanwhile, instead of directly asking the participants to submit the prediction results on the test dataset as standard deep learning challenges [1, 10], the organizers received either source codes or binary codes to evaluate their whole lifelong learning process to make fair comparison. The finalists’ methods were tested by the organizers on Intel Core i9 CPU and 1Nvidia RTX 1080 Ti GPU. For final round dataset, we randomly shuffled the dataset with multiple factors. Data is split up to batches/tasks and each batch/task samples are from one subdirectories (there are subdirectories in total, factors level/factor). Each batch includes 69 instances from scenes, about test samples, validation samples and training samples. The metrics and corresponding grading weights are shown in Table II. As can be seen, we also provided a bonus test set which is recorded in under different context background with some deformation. The adaptation on this bonus testing data is a challenging task for our task.
|Metric||Accuracy||Model Size||Inference Time||Replay Size||Oral Presentation||Accuracy on Bonus Dataset|
Ii-C Challenge Results
From more than registered participants, teams entered in the final phase and submitted results, codes, posters, slides and abstract papers (available here). Table III reports the details of all metrics (except oral presentation) for each team.
Architectures and main ideas: All the proposed methods use end-to-end deep learning models and employ the GPU(s) for training. For lifelong learning strategies: teams applied regularization methods, teams utilized knowledge distillation methods and team used network expansion method. teams applied resampling mechanism to alleviate catastrophic forgetting. Meanwhile, some other computer vision methods including saliency map, Single Shot multi-box Detection (SSD), data augmentation are also utilized in their solutions.
Iii Challenge Methods and Teams
Iii-a HIK_ILG Team
The team developed the dynamic neural network, which was comprised of two parts: dynamic network expansion for data across dissimilar domains and knowledge distillation for data in similar domains (See Figure3). They froze the shared convolutional layers and trained new heads for new tasks. The domain gap was determined by measuring the accuracy of the previous model before training on current task. In order to increase the generalization ability of the trained model, they used ImageNet pre-trained model for the shared convolutional layers, and took more data augmentation and more batches to train head1 for base model. Without using previous data, they discovered known instances in current task by a single forward pass via previous model. Those correctly classified were treated as known samples. They used these samples for knowledge distillation. They utilized the best head over multiple heads for distillation, which is verified by experimental results.
Iii-B Unibo Team
The team proposed a new Continual Learning approach based on latent rehearsal, namely the replay of latent neural network activation instead of raw images at the input level. The algorithm can be deployed on the edge with low latency. With latent rehearsal (see Figure 4) they denoted an approach where instead of maintaining in the external memory copies of input patterns in the form of raw data, they stored the pattern activation at a given level (denoted as latent rehearsal layer). The algorithm can be summarized as follow: 1) Take patterns from the current batch; 2) Forward them through the network until the rehearsal layer; 3) Select
patterns from the rehearsal memory; 4) Concat the original and the replay patterns; 5) Forward all the patterns through the rest of the network; 6) Backpropagate the loss only until the rehearsal layer.
The specific design they utilized with was AR1*, AR1*free and LwF CL approaches over a MobileNet-v1 and MobileNet-v2 [18, 14, 12, 4]. Meanwhile, they opted for simplicity and the trivial rehearsal approach summarized in Algorithm 1 is used for memory management.
Iii-C Guinness Team
The core backend of the approach was the learning without forgetting (LwF) . Figure 5 illustrates its training strategy. They deployed a pretrained MobileNet-v2 , in which the weights up to the bottleneck are retained as ( here was fine tuned during training) and they trained the bottleneck weights from scratch. Based on LwF, they retained the that is trained by previous tasks to construct the regularization term for training new weights . It should be noted that there was no replay of previous task images in this structure and only the updated was retained after training. Empirically, they loaded the initial pretrained weights when processing a new task and was going to be fine tuned during the training. Details of training scheme are included in Algorithm 2.
Iii-D Neverforget Team
The approach was based on Elastic Weight Consolidation (EWC) . As is shown in the Figure 6, the darker area means a smaller loss or a better solution to the task. First, the parameters of the model are initialized as and finetuned as for Task A. Then, If the model continues to learn Task B and finetuned as , the loss of Task A is getting much larger, and it will suffer from the forgetting problem. Instead, the Fisher Information Matrix is utilized to measure the importance of each parameter. If the parameter of the previous task is important, the parameter adjustment in this direction will be constrained and relatively small, if the parameter of the previous task is less important, there will be more space for parameter adjustment in this direction. Assume the importance (the second derivative of log-like function) of parameter is more than , in Task B, the parameter of the neural network will adjust more in direction. Thus, the model will gain knowledge of Task B while preserving the knowledge of Task A simultaneously. The ResNet-101  was used as the backbone network. The task was sequentially trained on the training set.
Iii-E SDU_BFA_PKU Team
The approach disentangled this problem with two aspects: background removal problem (See Figure 8) and classification problem.
First, they utilized saliency detection method to remove the background noise. Cascaded partial decoder framework which contains two branches is applied to get image saliency map. In each branch, they used a fast and effective partial decoder. The first branch generates an initial saliency map which is utilized to refine the features of the second branch. For classification problem with catastrophic forgetting, they utilized knowledge distillation to prevent it. They used an auto-encoder as a teacher translator, and an encoder as student translator, which has same architecture with teacher translator encoder. The model is aim to project saliency maps from teacher network and student network to same space. Specifically, For -th task, they regarded -th model as teacher network, and -th model as student network. In order to extract the factor from the teacher network, they trained the teacher translator in an unsupervised way by assigning the reconstruction loss at the beginning of every task training process. Then they utilized student translator to translate student network’s saliency map output, computed loss between teacher network output and student network. In order to save computational and storage size, they used MobileNet-v2 as backbone model .
Iii-F Vidit98 Team
This approach sampled validation data from the buffer and use it as replay data. It intelligently creates the replay memory for a task. Here suppose a network is trained on a task and it learns some feature representation of the images in the task, when trained on the task it learns the feature representation for images in task , but as the distribution of data is task is different, accuracy drops for images in task . The replay memory was an efficient representation of previous tasks data whose information was lost. The replay data was sampled from the validation of all the previous tasks. The network on task is trained and the accuracy of batches of validation data is saved. Next, when trained on task (, the accuracy of same batches of validation data of task is calculated. Then they stored the top batches from validation data of task whose accuracy has dropped the most. This is done for all the tasks to . Training for task they combined the replay data and training data to train for the particular task. The algorithm is shown in Algorithm 3. The backbone model they used is MobileNet-v2 . Code is made available.
Iii-G HYDRA-DI-ETRI Team
The team proposed a selective feature learning method to eliminate irrelevant objects in target images. A Single Shot multibox Detection (SSD) algorithm selected desired objects . The SSD algorithm alleviated performance degradation by noisy objects. Then SSD weights were trained with annotated images in task , and the refined dataset was fed into a traditional MobileNet .
The team also analyzed OpenLORIS-Object dataset to design object recognition software (See Figure 10), and find that target objects in the dataset coexist with unlabeled objects. The region of interest analysis is illustrated in Figure 9. Therefore, they proposed a selective feature learning method by eliminating irrelevant features in training dataset. The selective learning procedure is as follows: 1) extracting target objects from training dataset by an object detection algorithm, 2) feeding the refined dataset into a deep neural network to predict labels. In their software, they applied to a SSD as the object detection algorithm due to convenience of flexible feature network design and proper detection performances.
Iii-H NTU_LL Team
Iv Finalists Information
Title: Dynamic Neural Network for Incremental Learning
Members: Liang Ma, Jianwen Wu, Qiaoyong Zhong, Di Xie and Shiliang Pu
Affiliation: Hikvision Research Institute, Hangzhou, China.
Title: Efficient Continual Learning with Latent Rehearsal
Members: Gabriele Graffieti, Lorenzo Pellegrini, Vincenzo Lomonaco and Davide Maltoni
Affiliation: University of Bologna, Bologna, Italy.
Title: Learning Without Forgetting Approaches for Lifelong Robotic Vision
Members: Zhengwei Wang, Eoin Brophy and Tomás E. Ward
Affiliation: Zhengwei Wang is with V-SENSE, School of Computer Science and Statistics, Trinity College Dublin, Dublin, Irleand; Eoin Brophy and Tomás E. Ward are with the Inisht Centre for Data Analytics, School of Computing, Dublin City University, Dublin, Ireland.
Title: A Small Step to Remember: Study of Single Model VS Dynamic Model
Members: Liguang Zhou
Affiliation: The Chinese University of Hong Kong (Shenzhen),Shenzhen, China,
Shenzhen Institute of Artificial Intelligence and Robotics for Society, China.
Title: SDKD: Saliency Detection with Knowledge Distillation
Members: Lin Yang
Affiliation: Peking University, Beijing, China, Shandong University, Qingdao, China, Beijing Film Academy, Beijing, China.
Title: Intelligent Replay Sampling for Lifelong Object Recognition
Members: Vidit Goel, Debdoot Sheet and Somesh Kumar
Affiliation: Indian Institute of Technology, Kharagpur, India.
Title: Selective Feature Learning with Filtering Out Noisy Objects in Background Images
Members: Soonyong Song, Heechul Bae, Hyonyoung Han and Youngsung Son
Affiliation: Electronics and Telecommunications Research Institute (ETRI), Korea.
Title: Lifelong Learning with Regularization and Data Augmentation
Members: Duvindu Piyasena, Sathursan Kanagarajah, Siew-Kei Lam and Meiqing Wu
Affiliation: Nanyang Technological University, Singapore.
The work was partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 11215618). The authors would like to thank Hong Pong Ho from Intel RealSense Team for the technical support of RealSense cameras for recording the high-quality RGB-D data sequences.
Imagenet: A large-scale hierarchical image database.
Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §I, §II-B.
-  (2020) Challenges in task incremental learning for assistive robotics. IEEE Access 8 (), pp. 3434–3441. Cited by: §I.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-D, §III-H.
Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §III-B, §III-G.
-  (2017) Deep generative dual memory network for continual learning. arXiv preprint arXiv:1710.10368. Cited by: §I.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences (PNAS), pp. 3521 – 3526. Cited by: §III-D.
-  (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences (PNAS), pp. 3521–3526. Cited by: §I.
-  (2017-11) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. Cited by: §III-C.
-  (2017) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §I.
-  (2014) Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §I, §II-B.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §III-G.
-  (2019) Continuous learning in single-incremental-task scenarios. Neural Networks 116, pp. 56–73. Cited by: §III-B.
-  (2018) Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proceedings of the National Academy of Sciences (PNAS) 115 (44), pp. 467–475. Cited by: §I.
ContCap: a comprehensive framework for continual image captioning. arXiv preprint arXiv:1909.08745. Cited by: §III-B.
-  (2019) Latent replay for real-time continual learning. Arxiv preprint arXiv:1912.01100v2. Cited by: §III-B.
-  (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), pp. 2001–2010. Cited by: §I.
-  (2008) LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision 77 (1-3), pp. 157–173. Cited by: §II-A.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §III-B, §III-C, §III-E, §III-F.
-  (2018) Stochastic dynamical systems based latent structure discovery in high-dimensional time series. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 886–890. Cited by: §I.
-  (2019) OpenLORIS-object: a robotic vision dataset and benchmark for lifelong deep learning. External Links: Cited by: §II-A, §II.
-  (2018) Reduced-rank linear dynamical systems. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), Cited by: §I.
Neural dynamics discovery via gaussian process recurrent neural networks. arXiv preprint arXiv:1907.00650. Cited by: §I.
-  (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems (NIPS), pp. 2990–2999. Cited by: §I.
-  (2018) Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §I.
Neuroscore: a brain-inspired evaluation metric for generative adversarial networks. arXiv preprint arXiv:1905.04243. Cited by: §I.
-  (2020) A neuro-ai interface for evaluating generative adversarial networks. arXiv preprint arXiv:2003.03193. Cited by: §I.
-  (2019) Generative adversarial networks: a survey and taxonomy. arXiv preprint arXiv:1906.01529. Cited by: §I.
-  (2017) Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547. Cited by: §I.
Continual learning through synaptic intelligence.
Proceedings of the 34th International Conference on Machine Learning (ICML), Vol. 70, pp. 3987–3995. Cited by: §I, §III-H.