Unsupervised Object-Level Video Summarization with Online Motion Auto-Encoder
Unsupervised video summarization plays an important role on digesting, browsing, and searching the ever-growing videos everyday. Despite the great progress achieved by prior works (e.g., the frame-level video summarization), the underlying fine-grained semantic and motion information (i.e., objects of interest and their key motions) in online videos has been barely touched, which is more essential and beneficial for many down-streaming tasks (e.g., object retrieval) in an intelligent system. In this paper, we investigate a pioneer research direction towards the fine-grained unsupervised object-level video summarization. It can be distinguished from existing pipelines in two aspects: extracting key motions of participated objects, and learning to summarize in an unsupervised and online manner that is more applicable for online growing videos. To achieve this goal, we propose a novel online motion Auto-Encoder (online motion-AE) framework that functions on the super-segmented object motion clips. The online motion-AE mimics the online dictionary learning for memorizing past states of object motions by continuously updating a tailored recurrent auto-encoder network. This online updating scheme enables the differentiable optimization of jointly online feature learning and dictionary learning to discriminate key object-motion clips. Finally, the key object-motion clips can be mined by using the reconstruction errors obtained by the online motion-AE. Comprehensive experiments on a newly-collected surveillance dataset and the public Base jumping, SumMe, and TVSum datasets have demonstrated the effectiveness of online motion-AE, and the application potential of fine-grained object-level video summarization.
READ FULL TEXT