XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

07/14/2022
by   Ho Kei Cheng, et al.
0

We present XMem, a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model, we develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Crucially, we develop a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction. Combined with a new memory reading mechanism, XMem greatly exceeds state-of-the-art performance on long-video datasets while being on par with state-of-the-art methods (that do not work on long videos) on short-video datasets. Code is available at https://hkchengrex.github.io/XMem

READ FULL TEXT

page 1

page 19

page 20

page 21

page 22

page 23

research
07/31/2023

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

Recently, integrating video foundation models and large language models ...
research
05/22/2023

READMem: Robust Embedding Association for a Diverse Memory in Unconstrained Video Object Segmentation

We present READMem (Robust Embedding Association for a Diverse Memory), ...
research
11/18/2022

LVOS: A Benchmark for Long-term Video Object Segmentation

Existing video object segmentation (VOS) benchmarks focus on short-term ...
research
03/28/2018

Memory Warps for Learning Long-Term Online Video Representations

This paper proposes a novel memory-based online video representation tha...
research
05/15/2019

Automatic Long-Term Deception Detection in Group Interaction Videos

Most work on automated deception detection (ADD) in video has two restri...
research
01/20/2022

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

While today's video recognition systems parse snapshots or short clips a...
research
11/08/2016

Cognitive Discriminative Mappings for Rapid Learning

Humans can learn concepts or recognize items from just a handful of exam...

Please sign up or login with your details

Forgot password? Click here to reset