Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

06/21/2022
by   Fuchen Long, et al.
0

The leverage of large volumes of web videos paired with the searched queries or surrounding texts (e.g., title) offers an economic and extensible alternative to supervised video representation learning. Nevertheless, modeling such weakly visual-textual connection is not trivial due to query polysemy (i.e., many possible meanings for a query) and text isomorphism (i.e., same syntactic structure of different text). In this paper, we introduce a new design of mutual calibration between query and text to boost weakly-supervised video representation learning. Specifically, we present Bi-Calibration Networks (BCN) that novelly couples two calibrations to learn the amendment from text to query and vice versa. Technically, BCN executes clustering on all the titles of the videos searched by an identical query and takes the centroid of each cluster as a text prototype. The query vocabulary is built directly on query words. The video-to-text/video-to-query projections over text prototypes/query vocabulary then start the text-to-query or query-to-text calibration to estimate the amendment to query or text. We also devise a selection scheme to balance the two corrections. Two large-scale web video datasets paired with query and title for each video are newly collected for weakly-supervised video representation learning, which are named as YOVO-3M and YOVO-10M, respectively. The video features of BCN learnt on 3M web videos obtain superior results under linear model protocol on downstream tasks. More remarkably, BCN trained on the larger set of 10M web videos with further fine-tuning leads to 1.6 gains in top-1 accuracy on Kinetics-400, and Something-Something V2 datasets over the state-of-the-art TDN, and ACTION-Net methods with ImageNet pre-training. Source code and datasets are available at <https://github.com/FuchenUSTC/BCN>.

READ FULL TEXT

page 1

page 4

page 7

page 8

page 9

page 12

research
01/16/2020

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Current video representations heavily rely on learning from manually ann...
research
03/22/2023

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sequential video understanding, as an emerging video understanding task,...
research
05/07/2023

Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization

Weakly-supervised temporal action localization aims to identify and loca...
research
05/01/2023

Boosting Weakly-Supervised Temporal Action Localization with Text Information

Due to the lack of temporal annotation, current Weakly-supervised Tempor...
research
07/29/2020

Learning Video Representations from Textual Web Supervision

Videos found on the Internet are paired with pieces of text, such as tit...
research
02/08/2023

Weakly-supervised Representation Learning for Video Alignment and Analysis

Many tasks in video analysis and understanding boil down to the need for...
research
02/21/2023

HierCat: Hierarchical Query Categorization from Weakly Supervised Data at Facebook Marketplace

Query categorization at customer-to-customer e-commerce platforms like F...

Please sign up or login with your details

Forgot password? Click here to reset