Temporal Query Networks for Fine-grained Video Understanding

by   Chuhan Zhang, et al.

Our objective in this work is fine-grained classification of actions in untrimmed videos, where the actions may be temporally extended or may span only a few frames of the video. We cast this into a query-response mechanism, where each query addresses a particular question, and has its own response label set. We make the following four contributions: (I) We propose a new model - a Temporal Query Network - which enables the query-response functionality, and a structural understanding of fine-grained actions. It attends to relevant segments for each query with a temporal attention mechanism, and can be trained using only the labels for each query. (ii) We propose a new way - stochastic feature bank update - to train a network on videos of various lengths with the dense sampling required to respond to fine-grained queries. (iii) We compare the TQN to other architectures and text supervision methods, and analyze their pros and cons. Finally, (iv) we evaluate the method extensively on the FineGym and Diving48 benchmarks for fine-grained action classification and surpass the state-of-the-art using only RGB features.


page 1

page 5

page 6

page 12

page 13

page 19

page 20


Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

We address the problem of fine-grained action localization from temporal...

Robust Dialogue State Tracking with Weak Supervision and Sparse Data

Generalising dialogue state tracking (DST) to new data is especially cha...

V2CNet: A Deep Learning Framework to Translate Videos to Commands for Robotic Manipulation

We propose V2CNet, a new deep learning framework to automatically transl...

Hand Guided High Resolution Feature Enhancement for Fine-Grained Atomic Action Segmentation within Complex Human Assemblies

Due to the rapid temporal and fine-grained nature of complex human assem...

Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Recognising actions in videos relies on labelled supervision during trai...

Fine-Grained Few-Shot Classification with Feature Map Reconstruction Networks

In this paper we reformulate few-shot classification as a reconstruction...

Sports Video: Fine-Grained Action Detection and Classification of Table Tennis Strokes from Videos for MediaEval 2021

Sports video analysis is a prevalent research topic due to the variety o...

Please sign up or login with your details

Forgot password? Click here to reset