Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

11/20/2015
by   Anirudh Goyal, et al.
0

Integrating higher level visual and linguistic interpretations is at the heart of human intelligence. As automatic visual category recognition in images is approaching human performance, the high level understanding in the dynamic spatiotemporal domain of videos and its translation into natural language is still far from being solved. While most works on vision-to-text translations use pre-learned or pre-established computational linguistic models, in this paper we present an approach that uses vision alone to efficiently learn how to translate into language the video content. We discover, in simple form, the story played by main actors, while using only visual cues for representing objects and their interactions. Our method learns in a hierarchical manner higher level representations for recognizing subjects, actions and objects involved, their relevant contextual background and their interaction to one another over time. We have a three stage approach: first we take in consideration features of the individual entities at the local level of appearance, then we consider the relationship between these objects and actions and their video background, and third, we consider their spatiotemporal relations as inputs to classifiers at the highest level of interpretation. Thus, our approach finds a coherent linguistic description of videos in the form of a subject, verb and object based on their role played in the overall visual story learned directly from training data, without using a known language model. We test the efficiency of our approach on a large scale dataset containing YouTube clips taken in the wild and demonstrate state-of-the-art performance, often superior to current approaches that use more complex, pre-learned linguistic knowledge.

READ FULL TEXT

page 4

page 7

research
06/05/2018

Mining for meaning: from vision to language through multiple networks consensus

Describing visual data into natural language is a very challenging task,...
research
03/19/2021

ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

Text-based video segmentation is a challenging task that segments out th...
research
03/07/2017

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

We propose an unsupervised method for reference resolution in instructio...
research
08/11/2017

Exploiting Semantic Contextualization for Interpretation of Human Activity in Videos

We use large-scale commonsense knowledge bases, e.g. ConceptNet, to prov...
research
04/06/2016

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

This paper investigates how linguistic knowledge mined from large text c...
research
04/10/2017

Pay Attention to Those Sets! Learning Quantification from Images

Major advances have recently been made in merging language and vision re...
research
08/10/2021

TrUMAn: Trope Understanding in Movies and Animations

Understanding and comprehending video content is crucial for many real-w...

Please sign up or login with your details

Forgot password? Click here to reset