The "something something" video database for learning and evaluating visual common sense

06/13/2017
by   Raghav Goyal, et al.
0

Neural networks trained on datasets such as ImageNet have led to major advances in visual object classification. One obstacle that prevents networks from reasoning more deeply about complex scenes and situations, and from integrating visual knowledge with natural language, like humans do, is their lack of common sense knowledge about the physical world. Videos, unlike still images, contain a wealth of detailed information about the physical world. However, most labelled video datasets represent high-level concepts rather than detailed physical aspects about actions and scenes. In this work, we describe our ongoing collection of the "something-something" database of video prediction tasks whose solutions require a common sense understanding of the depicted situation. The database currently contains more than 100,000 videos across 174 classes, which are defined as caption-templates. We also describe the challenges in crowd-sourcing this data at scale.

READ FULL TEXT

page 1

page 5

page 6

page 10

page 14

page 20

page 21

research
02/21/2015

Don't Just Listen, Use Your Imagination: Leveraging Visual Common Sense for Non-Visual Tasks

Artificial agents today can answer factual questions. But they fall shor...
research
12/20/2017

Learning to Act Properly: Predicting and Explaining Affordances from Images

We address the problem of affordance reasoning in diverse scenes that ap...
research
06/21/2022

Automatic Concept Extraction for Concept Bottleneck-based Video Classification

Recent efforts in interpretable deep learning models have shown that con...
research
12/19/2017

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

There is growing interest in artificial intelligence to build socially i...
research
02/28/2018

Relational Neural Expectation Maximization: Unsupervised Discovery of Objects and their Interactions

Common-sense physical reasoning is an essential ingredient for any intel...
research
12/04/2018

The Visual Centrifuge: Model-Free Layered Video Representations

True video understanding requires making sense of non-lambertian scenes ...
research
11/21/2019

Teaching Perception

The visual world is very rich and generally too complex to perceive in i...

Please sign up or login with your details

Forgot password? Click here to reset