Fine-grained Video Classification and Captioning

04/24/2018
by   Farzaneh Mahdisoltani, et al.
6

We describe a DNN for fine-grained action classification and video captioning. It gives state-of-the-art performance on the challenging Something-Something dataset, with over 220, 000 videos and 174 fine-grained actions. Classification and captioning on this dataset are challenging because of the subtle differences between actions, the use of thousands of different objects, and the diversity of captions penned by crowd actors. The model architecture shares features for classification and captioning, and is trained end-to-end. It performs much better than the existing classification benchmark for Something-Something, with impressive fine-grained results, and it yields a strong baseline on the new Something-Something captioning task. Our results reveal that there is a strong correlation between the degree of detail in the task and the ability of the learned features to transfer to other tasks.

READ FULL TEXT

page 12

page 19

page 20

page 21

page 22

page 24

page 25

page 26

research
11/29/2017

Video Captioning via Hierarchical Reinforcement Learning

Video captioning is the task of automatically generating a textual descr...
research
03/27/2023

Fine-grained Audible Video Description

We explore a new task for audio-visual-language modeling called fine-gra...
research
07/20/2022

Spotting Temporally Precise, Fine-Grained Events in Video

We introduce the task of spotting temporally precise, fine-grained event...
research
03/26/2023

GOAL: A Challenging Knowledge-grounded Video Captioning Benchmark for Real-time Soccer Commentary Generation

Despite the recent emergence of video captioning models, how to generate...
research
10/07/2019

Human Action Sequence Classification

This paper classifies human action sequences from videos using a machine...
research
01/29/2018

End-to-End Fine-Grained Action Segmentation and Recognition Using Conditional Random Field Models and Discriminative Sparse Coding

Fine-grained action segmentation and recognition is an important yet cha...
research
06/20/2023

Dense Video Object Captioning from Disjoint Supervision

We propose a new task and model for dense video object captioning – dete...

Please sign up or login with your details

Forgot password? Click here to reset