PyTorchVideo: A Deep Learning Library for Video Understanding

11/18/2021 ∙ by Haoqi Fan, et al. ∙ 294

We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recording, storing, and viewing videos has become an ordinary part of our lives; in 2022, video traffic will amount to 82 percent of all Internet traffic (cisco_christoph). With the increasing amount of video material readily available (e.g. on the web), it is now more important than ever to develop ML frameworks for video understanding.

With the rise of deep learning, significant progress has been made in video understanding research, with novel neural network architectures, better training recipes, advanced data augmentation, and model acceleration techniques. However, the sheer amount of data that video brings often makes these tasks computationally demanding; therefore efficient solutions are non-trivial to implement.

To date, there exist several popular video understanding frameworks, which provide implementations of advanced state-of-the-art video models, including PySlowFast (fan2020pyslowfast), MMAction (mmaction2019), MMAction2 (2020mmaction2), and Gluon-CV (gluoncvnlp2020). However, unlike a modularized library that can be imported into different projects, all of these frameworks are designed around training workflows, which limit their adoption beyond applications tailored to one specific codebase.

More specifically, we see the following limitations in prior efforts. First, reproducibility – an important requirement for deep learning software – varies across frameworks; e.g. identical models are reproduced with varying accuracy on different frameworks (fan2020pyslowfast; mmaction2019; 2020mmaction2; gluoncvnlp2020). Second, regarding input modalities, the frameworks are mainly focused on visual-only data streams. Third, supported tasks only encompass human action classification and detection. Fourth, none of the existing codebases support on-device acceleration for real-time inference on mobile hardware.

We believe that a modular, component-focused video understanding library that addresses the aforementioned limitations will strongly support the video research community. Our intention is to develop a library that aims to provide fast and easily extensible components to benefit researchers and practitioners in academia and industry.

We present PyTorchVideo – an efficient, modular and reproducible deep learning library for video understanding which supports the following (see Fig. 1 for an overview):

  • a modular design with extendable interface for video modeling using Python

  • a full stack

    of video understanding machine learning components from established datasets to state-of-the-art models

  • real-time video classification through hardware accelerated on-device support

  • multiple tasks, including human action classification and detection, self-supervised learning, and low-level vision tasks

  • reproducible models and datasets, benchmarked in a comprehensive model zoo

  • multiple input modalities, including visual, audio, optical-flow and IMU data

PyTorchVideo is distributed with a Apache 2.0 License, and is available on GitHub at

2. Library Design

Our library follows four design principles, outlined next (§2.1-2.4).

2.1. Modularity

PyTorchVideo is built to be component centric: it provides independent components that are plug-and-play and ready to mix-and-match for any research or production use case. We achieve this by designing models, datasets and data transformations (transforms) independently, only enforcing consistency through general argument naming guidelines. For example, in the module all datasets provide a data_path argument, or, for the pytorchvideo.models module, any reference to input dimensions uses the name dim_in. This form of duck-typing provides flexibility and straightforward extensibility for new use cases.

2.2. Compatibility

PyTorchVideo is designed to be compatible with other frameworks and domain specific libraries. In contrast to existing video frameworks (fan2020pyslowfast; mmaction2019; 2020mmaction2; gluoncvnlp2020), PyTorchVideo does not rely on a configuration system. To maximize the compatibility with Python based frameworks that can have arbitrary config-systems, PyTorchVideo uses keyword arguments in Python as a “naive configuration” system.

PyTorchVideo is designed to be interoperable with other standard domain specific libraries by setting canonical modality based tensor types. For videos, we expect a tensor of shape

, where are spatiotemporal dimensions, and is the number of color channels, allowing any TorchVision model or transform to be used together with PyTorchVideo. For raw audio waveforms, we expect a tensor of shape , where is the temporal dimension, and for spectrograms, we expect a tensor of shape , where is time and is frequency, aligning with TorchAudio.

2.3. Customizability

    # Create a customized SlowFast Network.
    customized_slowfast = create_slowfast(
Algorithm 1 Code for a SlowFast network with customized norm and activation layer classes, and a custom head function.

One of PyTorchVideo’s primary use cases is supporting the latest research methods; we want researchers to easily contribute their work without requiring refactoring and architecture modifications. To achieve this, we designed the library to reduce overhead for adding new components or sub-modules. Notably in the pytorchvideo.model module, we use a dependency injection inspired API. We have a composable interface, which contains injectable skeleton classes and a factory function interface that builds reproducible implementations using composable classes. We anticipate this injectable class design to being useful for researchers that want to easily plug in new sub-components (e.g. a new type of convolution) into the structure of larger models such as a ResNet (He2016) or SlowFast (Feichtenhofer2019). The factory functions are more suitable for reproducible benchmarking of complete models, or usage in production. An example for a customized SlowFast network is in Algorithm 1.

2.4. Reproducibility

PyTorchVideo maintains reproducible implementations of all models and datasets. Each component is benchmarked against the reported performance in the respective, original publication. We report performance and release model files online111Numbers and model weights can be found in as well as on PyTorch Hub222 We rely on test coverage and recurrent benchmark jobs to verify and monitor performance and to detect potential regressions introduced by codebase updates.

3. Library Components

PyTorchVideo allows training of state-of-the-art models on multi-modal input data, and deployment of an accelerated real-time model on mobile devices. Example components are shown in Algorithm 2.

    from pytorchvideo import data, models, accelerator
    # Create visual and acoustic models.
    visual_model = models.slowfast.create_slowfast(
    acoustic_model = models.resnet.create_acoustic_resnet(
    # Create Kinetics dataloader.
    kinetics_loader =
    # Deploy model.
    visual_net_inst_deploy = accelerator.deployment.\
        convert_to_deployable_form(net_inst, input_tensor)
Algorithm 2 Code snippet to train, run inference, and deploy a state-of-the-art video model with PyTorchVideo.

3.1. Data

Video contains rich information streams from various sources, and, in comparison to image understanding, video is more computationally demanding. PyTorchVideo provides a modular, and efficient data loader to decode visual, motion (optical-flow), acoustic, and Inertial Measurement Unit (IMU) information from raw video.

PyTorchVideo supports a growing list of data loaders for various popular video datasets for different tasks: video classification task for UCF-101 (Soomro2012)

, HMDB-51 

(Kuehne2011), Kinetics (Kay2017)

, Charades 

(Sigurdsson2016), and Something-Something (ssv2), egocentric tasks for Epic Kitchen (Damen2018EPICKITCHENS) and DomSev (Silva2018), as well as video detection in AVA (Gu2018).

All data loaders support several file formats and are data storage agnostic. For encoded video datasets (e.g. videos stored in mp4 files), we provide PyAV, TorchVision, and Decord decoders. For long videos – when decoding is an overhead – PyTorchVideo provides support for pre-decoded video datasets in the form image files.

3.2. Transforms

Transforms - as the key components to improve the model generalization - are designed to be flexible and easy to use in PyTorchVideo. PyTorchVideo provides factory transforms that include common recipes for training state-of-the-art video models (Feichtenhofer2019; feichtenhofer2020x3d; mvit). Recent data augmentations are also provided by the library (e.g. MixUp, CutMix, RandAugment (cubuk2020randaugment), and AugMix (hendrycks2020augmix)). Finally, users have the option to create custom transforms by composing individual ones.

3.3. Models

PyTorchVideo contains highly reproducible implementations of popular models and backbones for video classification, acoustic event detection, human action localization (detection) in video, as well as self-supervised learning algorithms.

The current set of models includes standard single stream video backbones such as C2D (Wang2018), I3D (Wang2018), Slow-only (Feichtenhofer2019) for RGB frames and acoustic ResNet (fanyi2020) for audio signal, as well as efficient video networks such as SlowFast (Feichtenhofer2019), CSN (tran2019), R2+1D (tran2018), and X3D (feichtenhofer2020x3d) that provide state-of-the-art performance. PyTorchVideo also provides multipathway architectures such as Audiovisual SlowFast networks (fanyi2020) which enable state-of-the-art performance by disentangling spatial, temporal, and acoustic signals across different pathways.

It further supports methods for low-level vision tasks for researchers to build on the latest trends in video representation learning.

PyTorchVideo models can be used in combination with different downstream tasks: supervised classification and detection of human actions in video (Feichtenhofer2019), as well as self-supervised (i.e. unsupervised) video representation learning with Momentum Contrast (moco), SimCLR (simclr), and Bootstrap your own latent (byol).

3.4. Accelerator

Figure 2. Acceleration and deployment pipeline.

PyTorchVideo provides a complete environment (Accelerator) for hardware-aware design and deployment of models for fast inference on device, including efficient blocks and kernel optimization. The deployment flow is illustrated in Figure 2.

Specifically, we perform kernel-level latency optimization for common kernels in video understanding models (e.g. conv3d). This optimization brings two-fold benefits: (1) latency of these kernels for floating-point inference is significantly reduced; (2) quantized operation (int8) of these kernels is enabled, which is not supported for mobile devices by vanilla PyTorch. Our Accelerator provides a set of efficient blocks that build upon these optimized kernels, with their low latency validated by on-device profiling. In addition, our Accelerator provides kernel optimization in deployment flow, which can automatically replace modules in the original model with efficient blocks that perform equivalent operations. Overall, the PyTorchVideo Accelerator provides a complete environment for hardware-aware model design and deployment for fast inference.

4. Benchmarks

PyTorchVideo supports popular video understanding tasks, such as video classification (Kay2017; ssv2; Feichtenhofer2019), action detection (Gu2018), video self-supervised learning (feichtenhofer2021large) and efficient video understanding on mobile hardware. We provide comprehensive benchmarks for different tasks and a large set of model weights for state-of-the-art methods. This section provides a snapshot of the benchmarks. A comprehensive listing can be found in the model zoo 333A full set of supported models on different datasets can be found from The models in PyTorchVideos’ model zoo reproduce performance of the original publications and can seed future research that builds upon them, bypassing the need for re-implementation and costly re-training.

4.1. Video classification

model Top-1 Top-5 FLOPs (G) Param (M)
I3D R50, 8(Wang2018) 73.3 90.7 37.5 x 3 x 10 28.0
X3D-L, 16(feichtenhofer2020x3d) 77.4 93.3 26.6 x 3 x 10 6.2
SlowFast R101, 16(Feichtenhofer2019) 78.7 93.6 215.6 x 3 x 10 53.8
MViT-B, 16(mvit) 78.8 93.9 70.5 × 3 × 10 36.6
Table 1.

Video classification results on Kinetics-400 


PyTorchVideo implements classification for various datasets, including UCF-101 (Soomro2012), HMDB-51 (Kuehne2011), Kinetics (Kay2017), Charades (Sigurdsson2016), Something-Something (ssv2) and Epic-Kitchens (Damen2018EPICKITCHENS). Table 1 shows a benchmark snapshot on Kinetics-400 for three popular state-of-the-art methods, which are measured in Top-1 and Top-5 accuracies. Further classification models with pre-trained weights, that can be used for a variety of downstream tasks, are available online.

4.2. Video action detection

The video action detection task aims to perform spatiotemporal localization of human actions in videos. Table 2 shows detection performance in mean Average Precision (mAP) on the AVA dataset (Gu2018) using Slow and SlowFast networks (Feichtenhofer2019).

model mAP Param (M)
Slow R50, 416 (Feichtenhofer2019) 19.50 31.78
SlowFast R50, 8(Feichtenhofer2019) 24.67 33.82
Table 2. Video action detection results on AVA dataset  (Gu2018).

4.3. Video self-supervised learning

We provide reference implementations of popular self-supervised learning methods for video (feichtenhofer2021large), which can be used to perform unsupervised spatiotemporal representation learning on large-scale video data. Table 3 summarizes the results on 5 downstream datasets.

SSL method Kinetics UCF101 AVA Charades SSv2
SimCLR (simclr) 62.0 87.9 17.6 11.4 52.0
BYOL (byol) 68.3 93.8 23.4 21.0 55.8
MoCo (moco) 67.3 92.8 20.3 33.5 54.4
Table 3. Performance of 4 video self-supervised learning (SSL) methods on 5 downstream datasets (feichtenhofer2021large).

4.4. Efficient mobile video understanding

The Accelerator environment for efficient video understanding on mobile devices is benchmarked in Table 4. We show several efficient X3D models (feichtenhofer2020x3d) on a Samsung Galaxy S9 mobile phone. With the efficient optimization strategies in PyTorchVideo, X3D achieves 4.6 - 5.6 inference speed up, compared to the default PyTorch implementation. With quantization (int8), it can be further accelerated by 1.4. The resulting PyTorchVideo-accelerated X3D model runs around 6 faster than real time, requiring roughly 165ms to process one second of video, directly on the mobile phone. The source code for an on-device demo in iOS444 and Android555 is available.

model Vanilla PyTorch
Latency (ms) PyTorchVideo
Latency (ms) Speed Up
X3D-XS (fp32) (feichtenhofer2020x3d) 1067 233 4.6
X3D-S (fp32) (feichtenhofer2020x3d) 4249 764 5.6
X3D-XS (int8) (feichtenhofer2020x3d) (not supported) 165 -
Table 4. Speedup of PyTorchVideo models on mobile CPU.

5. Conclusion

We introduce PyTorchVideo, an efficient, flexible, and modular deep learning library for video understanding that scales to a variety of research and production applications. Our library welcomes contributions from the research and open-source community, and will be continuously updated to support further innovation. In future work, we plan to continue enhancing the library with optimized components and reproducible state-of-the-art models.

We thank the PyTorchVideo contributors: Aaron Adcock, Amy Bearman, Bernard Nguyen, Bo Xiong, Chengyuan Yan, Christoph Feichtenhofer, Dave Schnizlein, Haoqi Fan, Heng Wang, Jackson Hamburger, Jitendra Malik, Kalyan Vasudev Alwala, Matt Feiszli, Nikhila Ravi, Ross Girshick, Tullie Murrell, Wan-Yen Lo, Weiyao Wang, Yanghao Li, Yilei Li, Zhengxing Chen, Zhicheng Yan.