Where were my keys? – Aggregating Spatial-Temporal Instances of Objects for Efficient Retrieval over Long Periods of Time

by   Ifrah Idrees, et al.

Robots equipped with situational awareness can help humans efficiently find their lost objects by leveraging spatial and temporal structure. Existing approaches to video and image retrieval do not take into account the unique constraints imposed by a moving camera with a partial view of the environment. We present a Detection-based 3-level hierarchical Association approach, D3A, to create an efficient query-able spatial-temporal representation of unique object instances in an environment. D3A performs online incremental and hierarchical learning to identify keyframes that best represent the unique objects in the environment. These keyframes are learned based on both spatial and temporal features and once identified their corresponding spatial-temporal information is organized in a key-value database. D3A allows for a variety of query patterns such as querying for objects with/without the following: 1) specific attributes, 2) spatial relationships with other objects, and 3) time slices. For a given set of 150 queries, D3A returns a small set of candidate keyframes (which occupy only 0.17 in 11.7 ms. This is 47x faster and 33 naively stores the object matches (detections) in the database without associating spatial-temporal information.



There are no comments yet.


page 1

page 8


Hierarchical Information Quadtree: Efficient Spatial Temporal Image Search for Multimedia Stream

Massive amount of multimedia data that contain times- tamps and geograph...

Spatial-Temporal Person Re-identification

Most of current person re-identification (ReID) methods neglect a spatia...

ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos

Detecting human-object interactions (HOI) is an important step toward a ...

STURE: Spatial-Temporal Mutual Representation Learning for Robust Data Association in Online Multi-Object Tracking

Online multi-object tracking (MOT) is a longstanding task for computer v...

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

3D visual perception tasks, including 3D detection and map segmentation ...

Confidence-guided Adaptive Gate and Dual Differential Enhancement for Video Salient Object Detection

Video salient object detection (VSOD) aims to locate and segment the mos...

Identifying Most Walkable Direction for Navigation in an Outdoor Environment

We present an approach for identifying the most walkable direction for n...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.