Boosting Video Representation Learning with Multi-Faceted Integration

01/11/2022
by   Zhaofan Qiu, et al.
0

Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset. There is no study yet on how to learn a video representation from multifaceted labels, and whether multifaceted information is helpful for video representation learning. In this paper, we propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content. Technically, MUFI formulates the problem as visual-semantic embedding learning, which explicitly maps video representation into a rich semantic embedding space, and jointly optimizes video representation from two perspectives. One is to capitalize on the intra-facet supervision between each video and its own label descriptions, and the second predicts the "semantic representation" of each video from the facets of other datasets as the inter-facet supervision. Extensive experiments demonstrate that learning 3D CNN via our MUFI framework on a union of four large-scale video datasets plus two image datasets leads to superior capability of video representation. The pre-learnt 3D CNN with MUFI also shows clear improvements over other approaches on several downstream video applications. More remarkably, MUFI achieves 98.1 recognition and 101.5

READ FULL TEXT
research
07/01/2022

(Un)likelihood Training for Interpretable Embedding

Cross-modal representation learning has become a new normal for bridging...
research
06/16/2022

iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

Learning visual representations through self-supervision is an extremely...
research
03/25/2023

Learning video embedding space with Natural Language Supervision

The recent success of the CLIP model has shown its potential to be appli...
research
05/01/2020

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning...
research
07/29/2020

Learning Video Representations from Textual Web Supervision

Videos found on the Internet are paired with pieces of text, such as tit...
research
08/03/2020

SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning

A steady momentum of innovations and breakthroughs has convincingly push...
research
05/24/2023

FaceFusion: Exploiting Full Spectrum of Multiple Datasets

The size of training dataset is known to be among the most dominating as...

Please sign up or login with your details

Forgot password? Click here to reset