Revisiting 3D ResNets for Video Recognition

09/03/2021
by   Xianzhi Du, et al.
21

A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition. This short note studies effective training and scaling strategies for video recognition models. We propose a simple scaling strategy for 3D ResNets, in combination with improved training strategies and minor architectural changes. The resulting models, termed 3D ResNet-RS, attain competitive performance of 81.0 on Kinetics-400 and 83.8 on Kinetics-600 without pre-training. When pre-trained on a large Web Video Text dataset, our best model achieves 83.5 and 84.3 on Kinetics-400 and Kinetics-600. The proposed scaling rule is further evaluated in a self-supervised setup using contrastive learning, demonstrating improved performance. Code is available at: https://github.com/tensorflow/models/tree/master/official.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/13/2021

Revisiting ResNets: Improved Training and Scaling Strategies

Novel computer vision architectures monopolize the spotlight, but the im...
research
08/13/2020

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation...
research
06/09/2022

PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies

PointNet++ is one of the most influential neural architectures for point...
research
09/14/2022

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

The pre-trained image-text models, like CLIP, have demonstrated the stro...
research
03/23/2022

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Pre-training video transformers on extra large-scale datasets is general...
research
07/07/2023

CLIPMasterPrints: Fooling Contrastive Language-Image Pre-training Using Latent Variable Evolution

Models leveraging both visual and textual data such as Contrastive Langu...
research
06/19/2023

Road Barlow Twins: Redundancy Reduction for Road Environment Descriptors and Motion Prediction

Anticipating the future motion of traffic agents is vital for self-drivi...

Please sign up or login with your details

Forgot password? Click here to reset