Exploiting Semantic Contextualization for Interpretation of Human Activity in Videos

08/11/2017
by   Sathyanarayanan N. Aakur, et al.
0

We use large-scale commonsense knowledge bases, e.g. ConceptNet, to provide context cues to establish semantic relationships among entities directly hypothesized from video signal, such as putative object and actions labels, and infer a deeper interpretation of events than what is directly sensed. One approach is to learn semantic relationships between objects and actions from training annotations of videos and as such, depend largely on statistics of the vocabulary in these annotations. However, the use of prior encoded commonsense knowledge sources alleviates this dependence on large annotated training datasets. We represent interpretations using a connected structure of basic detected (grounded) concepts, such as objects and actions, that are bound by semantics with other background concepts not directly observed, i.e. contextualization cues. We mathematically express this using the language of Grenander's pattern generator theory. Concepts are basic generators and the bonds are defined by the semantic relationships between concepts. We formulate an inference engine based on energy minimization using an efficient Markov Chain Monte Carlo that uses the ConceptNet in its move proposals to find these structures. Using three different publicly available datasets, Breakfast, CMU Kitchen and MSVD, whose distribution of possible interpretations span more than 150000 possible solutions for over 5000 videos, we show that the proposed model can generate video interpretations whose quality are comparable or better than those reported by approaches such as discriminative approaches, hidden Markov models, context free grammars, deep learning models, and prior pattern theory approaches, all of whom rely on learning from domain-specific training data.

READ FULL TEXT
research
05/26/2023

Discovering Novel Actions in an Open World with Object-Grounded Visual Commonsense Reasoning

Learning to infer labels in an open world, i.e., in an environment where...
research
11/20/2015

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Integrating higher level visual and linguistic interpretations is at the...
research
05/27/2019

Commonsense Properties from Query Logs and Question Anwering Forums

Commonsense knowledge about object properties, human behavior and genera...
research
08/07/2016

Spacetimes with Semantics (III) - The Structure of Functional Knowledge Representation and Artificial Reasoning

Using the previously developed concepts of semantic spacetime, I explore...
research
05/27/2019

Commonsense Properties from Query Logs and Question Answering Forums

Commonsense knowledge about object properties, human behavior and genera...
research
08/07/2020

A Context-based Disambiguation Model for Sentiment Concepts Using a Bag-of-concepts Approach

With the widespread dissemination of user-generated content on different...
research
06/04/2019

Natural Vocabulary Emerges from Free-Form Annotations

We propose an approach for annotating object classes using free-form tex...

Please sign up or login with your details

Forgot password? Click here to reset