SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations

by   Tanvir Mahmud, et al.

Despite significant progress in semi-supervised learning for image object detection, several key issues are yet to be addressed for video object detection: (1) Achieving good performance for supervised video object detection greatly depends on the availability of annotated frames. (2) Despite having large inter-frame correlations in a video, collecting annotations for a large number of frames per video is expensive, time-consuming, and often redundant. (3) Existing semi-supervised techniques on static images can hardly exploit the temporal motion dynamics inherently present in videos. In this paper, we introduce SSVOD, an end-to-end semi-supervised video object detection framework that exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations. To selectively assemble robust pseudo-labels across groups of frames, we introduce flow-warped predictions from nearby frames for temporal-consistency estimation. In particular, we introduce cross-IoU and cross-divergence based selection methods over a set of estimated predictions to include robust pseudo-labels for bounding boxes and class labels, respectively. To strike a balance between confirmation bias and uncertainty noise in pseudo-labels, we propose confidence threshold based combination of hard and soft pseudo-labels. Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS datasets. Code and pre-trained models will be released.


Semi-Supervised Video Salient Object Detection Using Pseudo-Labels

Deep learning-based video salient object detection has recently achieved...

Mix-Teaching: A Simple, Unified and Effective Semi-Supervised Learning Framework for Monocular 3D Object Detection

Monocular 3D object detection is an essential perception task for autono...

Knowledge-Spreader: Learning Facial Action Unit Dynamics with Extremely Limited Labels

Recent studies on the automatic detection of facial action unit (AU) hav...

Semi-supervised 3D Object Detection via Adaptive Pseudo-Labeling

3D object detection is an important task in computer vision. Most existi...

Unsupervised Adversarial Visual Level Domain Adaptation for Learning Video Object Detectors from Images

Deep learning based object detectors require thousands of diversified bo...

Semantics through Time: Semi-supervised Segmentation of Aerial Videos with Iterative Label Propagation

Semantic segmentation is a crucial task for robot navigation and safety....

A Novel Video Salient Object Detection Method via Semi-supervised Motion Quality Perception

Previous video salient object detection (VSOD) approaches have mainly fo...

Please sign up or login with your details

Forgot password? Click here to reset