Recording, storing, and viewing videos has become an ordinary part of our lives; in 2022, video traffic will amount to 82 percent of all Internet traffic (cisco_christoph). With the increasing amount of video material readily available (e.g. on the web), it is now more important than ever to develop ML frameworks for video understanding.
With the rise of deep learning, significant progress has been made in video understanding research, with novel neural network architectures, better training recipes, advanced data augmentation, and model acceleration techniques. However, the sheer amount of data that video brings often makes these tasks computationally demanding; therefore efficient solutions are non-trivial to implement.
To date, there exist several popular video understanding frameworks, which provide implementations of advanced state-of-the-art video models, including PySlowFast (fan2020pyslowfast), MMAction (mmaction2019), MMAction2 (2020mmaction2), and Gluon-CV (gluoncvnlp2020). However, unlike a modularized library that can be imported into different projects, all of these frameworks are designed around training workflows, which limit their adoption beyond applications tailored to one specific codebase.
More specifically, we see the following limitations in prior efforts. First, reproducibility – an important requirement for deep learning software – varies across frameworks; e.g. identical models are reproduced with varying accuracy on different frameworks (fan2020pyslowfast; mmaction2019; 2020mmaction2; gluoncvnlp2020). Second, regarding input modalities, the frameworks are mainly focused on visual-only data streams. Third, supported tasks only encompass human action classification and detection. Fourth, none of the existing codebases support on-device acceleration for real-time inference on mobile hardware.
We believe that a modular, component-focused video understanding library that addresses the aforementioned limitations will strongly support the video research community. Our intention is to develop a library that aims to provide fast and easily extensible components to benefit researchers and practitioners in academia and industry.
We present PyTorchVideo – an efficient, modular and reproducible deep learning library for video understanding which supports the following (see Fig. 1 for an overview):
a modular design with extendable interface for video modeling using Python
a full stack
of video understanding machine learning components from established datasets to state-of-the-art models
real-time video classification through hardware accelerated on-device support
multiple tasks, including human action classification and detection, self-supervised learning, and low-level vision tasks
reproducible models and datasets, benchmarked in a comprehensive model zoo
multiple input modalities, including visual, audio, optical-flow and IMU data
PyTorchVideo is distributed with a Apache 2.0 License, and is available on GitHub at https://github.com/facebookresearch/pytorchvideo.
2. Library Design
PyTorchVideo is built to be component centric: it provides independent components that are plug-and-play and ready to mix-and-match for any research or production use case. We achieve this by designing models, datasets and data transformations (transforms) independently, only enforcing consistency through general argument naming guidelines. For example, in the
pytorchvideo.data module all datasets provide a
data_path argument, or, for the
pytorchvideo.models module, any reference to input dimensions uses the name
dim_in. This form of duck-typing provides flexibility and straightforward extensibility for new use cases.
PyTorchVideo is designed to be compatible with other frameworks and domain specific libraries. In contrast to existing video frameworks (fan2020pyslowfast; mmaction2019; 2020mmaction2; gluoncvnlp2020), PyTorchVideo does not rely on a configuration system. To maximize the compatibility with Python based frameworks that can have arbitrary config-systems, PyTorchVideo uses keyword arguments in Python as a “naive configuration” system.
PyTorchVideo is designed to be interoperable with other standard domain specific libraries by setting canonical modality based tensor types. For videos, we expect a tensor of shape, where are spatiotemporal dimensions, and is the number of color channels, allowing any TorchVision model or transform to be used together with PyTorchVideo. For raw audio waveforms, we expect a tensor of shape , where is the temporal dimension, and for spectrograms, we expect a tensor of shape , where is time and is frequency, aligning with TorchAudio.
One of PyTorchVideo’s primary use cases is supporting the latest research methods; we want researchers to easily contribute their work without requiring refactoring and architecture modifications. To achieve this, we designed the library to reduce overhead for adding new components or sub-modules. Notably in the
pytorchvideo.model module, we use a dependency injection inspired API. We have a composable interface, which contains injectable skeleton classes and a factory function interface that builds reproducible implementations using composable classes. We anticipate this injectable class design to being useful for researchers that want to easily plug in new sub-components (e.g. a new type of convolution) into the structure of larger models such as a ResNet (He2016) or SlowFast (Feichtenhofer2019). The factory functions are more suitable for reproducible benchmarking of complete models, or usage in production. An example for a customized SlowFast network is in Algorithm 1.
PyTorchVideo maintains reproducible implementations of all models and datasets. Each component is benchmarked against the reported performance in the respective, original publication. We report performance and release model files online111Numbers and model weights can be found in https://github.com/facebookresearch/pytorchvideo/blob/master/docs/source/model_zoo.md as well as on PyTorch Hub222https://pytorch.org/hub/. We rely on test coverage and recurrent benchmark jobs to verify and monitor performance and to detect potential regressions introduced by codebase updates.
3. Library Components
PyTorchVideo allows training of state-of-the-art models on multi-modal input data, and deployment of an accelerated real-time model on mobile devices. Example components are shown in Algorithm 2.
Video contains rich information streams from various sources, and, in comparison to image understanding, video is more computationally demanding. PyTorchVideo provides a modular, and efficient data loader to decode visual, motion (optical-flow), acoustic, and Inertial Measurement Unit (IMU) information from raw video.
PyTorchVideo supports a growing list of data loaders for various popular video datasets for different tasks: video classification task for UCF-101 (Soomro2012)
, HMDB-51(Kuehne2011), Kinetics (Kay2017)
, Charades(Sigurdsson2016), and Something-Something (ssv2), egocentric tasks for Epic Kitchen (Damen2018EPICKITCHENS) and DomSev (Silva2018), as well as video detection in AVA (Gu2018).
All data loaders support several file formats and are data storage agnostic. For encoded video datasets (e.g. videos stored in mp4 files), we provide PyAV, TorchVision, and Decord decoders. For long videos – when decoding is an overhead – PyTorchVideo provides support for pre-decoded video datasets in the form image files.
Transforms - as the key components to improve the model generalization - are designed to be flexible and easy to use in PyTorchVideo. PyTorchVideo provides factory transforms that include common recipes for training state-of-the-art video models (Feichtenhofer2019; feichtenhofer2020x3d; mvit). Recent data augmentations are also provided by the library (e.g. MixUp, CutMix, RandAugment (cubuk2020randaugment), and AugMix (hendrycks2020augmix)). Finally, users have the option to create custom transforms by composing individual ones.
PyTorchVideo contains highly reproducible implementations of popular models and backbones for video classification, acoustic event detection, human action localization (detection) in video, as well as self-supervised learning algorithms.
The current set of models includes standard single stream video backbones such as C2D (Wang2018), I3D (Wang2018), Slow-only (Feichtenhofer2019) for RGB frames and acoustic ResNet (fanyi2020) for audio signal, as well as efficient video networks such as SlowFast (Feichtenhofer2019), CSN (tran2019), R2+1D (tran2018), and X3D (feichtenhofer2020x3d) that provide state-of-the-art performance. PyTorchVideo also provides multipathway architectures such as Audiovisual SlowFast networks (fanyi2020) which enable state-of-the-art performance by disentangling spatial, temporal, and acoustic signals across different pathways.
It further supports methods for low-level vision tasks for researchers to build on the latest trends in video representation learning.
PyTorchVideo models can be used in combination with different downstream tasks: supervised classification and detection of human actions in video (Feichtenhofer2019), as well as self-supervised (i.e. unsupervised) video representation learning with Momentum Contrast (moco), SimCLR (simclr), and Bootstrap your own latent (byol).
PyTorchVideo provides a complete environment (Accelerator) for hardware-aware design and deployment of models for fast inference on device, including efficient blocks and kernel optimization. The deployment flow is illustrated in Figure 2.
Specifically, we perform kernel-level latency optimization for common kernels in video understanding models (e.g. conv3d). This optimization brings two-fold benefits: (1) latency of these kernels for floating-point inference is significantly reduced; (2) quantized operation (int8) of these kernels is enabled, which is not supported for mobile devices by vanilla PyTorch. Our Accelerator provides a set of efficient blocks that build upon these optimized kernels, with their low latency validated by on-device profiling. In addition, our Accelerator provides kernel optimization in deployment flow, which can automatically replace modules in the original model with efficient blocks that perform equivalent operations. Overall, the PyTorchVideo Accelerator provides a complete environment for hardware-aware model design and deployment for fast inference.
PyTorchVideo supports popular video understanding tasks, such as video classification (Kay2017; ssv2; Feichtenhofer2019), action detection (Gu2018), video self-supervised learning (feichtenhofer2021large) and efficient video understanding on mobile hardware. We provide comprehensive benchmarks for different tasks and a large set of model weights for state-of-the-art methods. This section provides a snapshot of the benchmarks. A comprehensive listing can be found in the model zoo 333A full set of supported models on different datasets can be found from https://pytorchvideo.readthedocs.io/en/latest/model_zoo.html. The models in PyTorchVideos’ model zoo reproduce performance of the original publications and can seed future research that builds upon them, bypassing the need for re-implementation and costly re-training.
4.1. Video classification
|model||Top-1||Top-5||FLOPs (G)||Param (M)|
|I3D R50, 88 (Wang2018)||73.3||90.7||37.5 x 3 x 10||28.0|
|X3D-L, 165 (feichtenhofer2020x3d)||77.4||93.3||26.6 x 3 x 10||6.2|
|SlowFast R101, 168 (Feichtenhofer2019)||78.7||93.6||215.6 x 3 x 10||53.8|
|MViT-B, 164 (mvit)||78.8||93.9||70.5 × 3 × 10||36.6|
Video classification results on Kinetics-400(Kay2017).
PyTorchVideo implements classification for various datasets, including UCF-101 (Soomro2012), HMDB-51 (Kuehne2011), Kinetics (Kay2017), Charades (Sigurdsson2016), Something-Something (ssv2) and Epic-Kitchens (Damen2018EPICKITCHENS). Table 1 shows a benchmark snapshot on Kinetics-400 for three popular state-of-the-art methods, which are measured in Top-1 and Top-5 accuracies. Further classification models with pre-trained weights, that can be used for a variety of downstream tasks, are available online.
4.2. Video action detection
The video action detection task aims to perform spatiotemporal localization of human actions in videos. Table 2 shows detection performance in mean Average Precision (mAP) on the AVA dataset (Gu2018) using Slow and SlowFast networks (Feichtenhofer2019).
|Slow R50, 416 (Feichtenhofer2019)||19.50||31.78|
|SlowFast R50, 88 (Feichtenhofer2019)||24.67||33.82|
4.3. Video self-supervised learning
We provide reference implementations of popular self-supervised learning methods for video (feichtenhofer2021large), which can be used to perform unsupervised spatiotemporal representation learning on large-scale video data. Table 3 summarizes the results on 5 downstream datasets.
4.4. Efficient mobile video understanding
The Accelerator environment for efficient video understanding on mobile devices is benchmarked in Table 4. We show several efficient X3D models (feichtenhofer2020x3d) on a Samsung Galaxy S9 mobile phone. With the efficient optimization strategies in PyTorchVideo, X3D achieves 4.6 - 5.6 inference speed up, compared to the default PyTorch implementation. With quantization (int8), it can be further accelerated by 1.4. The resulting PyTorchVideo-accelerated X3D model runs around 6 faster than real time, requiring roughly 165ms to process one second of video, directly on the mobile phone. The source code for an on-device demo in iOS444https://github.com/pytorch/ios-demo-app/tree/master/TorchVideo and Android555https://github.com/pytorch/android-demo-app/tree/master/TorchVideo is available.
|Latency (ms)||Speed Up|
|X3D-XS (fp32) (feichtenhofer2020x3d)||1067||233||4.6|
|X3D-S (fp32) (feichtenhofer2020x3d)||4249||764||5.6|
|X3D-XS (int8) (feichtenhofer2020x3d)||(not supported)||165||-|
We introduce PyTorchVideo, an efficient, flexible, and modular deep learning library for video understanding that scales to a variety of research and production applications. Our library welcomes contributions from the research and open-source community, and will be continuously updated to support further innovation. In future work, we plan to continue enhancing the library with optimized components and reproducible state-of-the-art models.