Existing matching-based approaches perform video object segmentation (VO...
Vision-language navigation is the task of directing an embodied agent to...
Since context modeling is critical for estimating depth from a single im...
3D visual grounding aims to locate the referred target object in 3D poin...
Very recently, a variety of vision transformer architectures for dense
p...
In this work we present SwiftNet for real-time semi-supervised video obj...