Current state-of-the-art models for video action recognition are mostly based on expensive 3D ConvNets. This results in a need for large GPU clusters to train and evaluate such architectures. To address this problem, we present a lightweight and memory-friendly architecture for action recognition that performs on par with or better than current architectures by using only a fraction of resources. The proposed architecture is based on a combination of a deep subnet operating on low-resolution frames with a compact subnet operating on high-resolution frames, allowing for high efficiency and accuracy at the same time. We demonstrate that our approach achieves a reduction by 3∼4 times in FLOPs and ∼2 times in memory usage compared to the baseline. This enables training deeper models with more input frames under the same computational budget. To further obviate the need for large-scale 3D convolutions, a temporal aggregation module is proposed to model temporal dependencies in a video at very small additional computational costs. Our models achieve strong performance on several action recognition benchmarks including Kinetics, Something-Something and Moments-in-time. The code and models are available at https://github.com/IBM/bLVNet-TAM.READ FULL TEXT VIEW PDF
Leading methods in the domain of action recognition try to distill
Video action detection approaches usually conduct actor-centric action
Efficient video action recognition remains a challenging problem. One la...
We present Mobile Video Networks (MoViNets), a family of computation and...
We present SlowFast networks for video recognition. Our model involves (...
Video Analytics Software as a Service (VA SaaS) has been rapidly growing...
Temporal motion has been one of the essential components for effectively...
Current state-of-the-art approaches for video action recognition are based on convolutional neural networks (CNNs). These include the best performing 3D models, such as I3DI3D:carreira2017quo and ResNet3D ResNet3D:hara2017learning , and some effective 2D models, such as Temporal Relation Networks (TRN) TRN:zhou2018temporal and Temporal Shift Modules (TSM) TSM:lin2018temporal . A CNN-based model usually considers a sequence of frames as input, obtained through either uniform or dense sampling from a video I3D:carreira2017quo ; TSN:wang2016temporal . In general, Longer input sequences yield better recognition results. However, one problem arising for a model requesting more input frames is that the GPU resources required for training and inference also significantly increase in both memory and time. For example, the top-performing I3D models I3D:carreira2017quo on the Kinetics Kinetics:kay2017kinetics dataset were trained with 64 frames on a cluster of 32 GPUs, and the non-local network Wang2018NonLocal even uses 128 frames as input. Another problem for action recognition is the lack of effective methods for temporal modeling when moving away from 3D spatiotemporal convolutions. While 2D convolutional models are more resource-friendly than their 3D counterparts, they lack expressiveness over time and thus cannot take much benefit from richer input data.
In this paper, we present an efficient and memory-friendly spatio-temporal representation for action recognition, which enables training of deeper models while allowing for more input frames. The first part of our approach is inspired by the Big-Little-Net architecture (bLNet chen2018biglittle ). We propose a new video architecture that has two network branches with different complexities: one branch processing low-resolution frames in a very deep subnet, and another branch processing high-resolution frames in a compact subnet. The two branches complement each other through merging at the end of each network layer. With such a design, our approach can process twice as many frames as the baseline model without compromising efficiency. We refer to this architecture as “Big-Little-Video-Net” (bLVNet).
In light of the limited ability of capturing temporal dependencies in bLVNet, we further develop an effective method to exploit temporal relations across frames by a so called “Depthwise Temporal Aggregation Module” (TAM). The method enables the exchange of temporal information between frames by weighted channel-wise aggregation. This aggregation is made learnable with 11 depthwise convolution, and implemented as an independent network module. The temporal aggregation module can be easily integrated into the proposed network architecture to progressively learn spatio-temporal patterns in a hierarchical way. Moreover, the module is extremely compact and adds only negligible computational costs and parameters to bLVNet.
Our main contributions lie in the following two interconnected aspects: (1) We propose a lightweight video architecture based on dual-path network to learn video features, and (2) we develop a temporal aggregation module to enable effective temporal modeling without the need for computationally expensive 3D convolutions.
We evaluate our approach on the Kinetics-400 Kinetics:kay2017kinetics , Something-Something Something:goyal2017something and Moments-in-time Moments:monfort2019moments datasets. The evaluation shows that bLVNet-TAM successfully allows us to train action-classification models with deeper backbones (i.e., ResNet-101) as well as more (up to 64) input frames, using a single compute node with 8 Tesla V100 GPUs. Our comprehensive experiments demonstrate that our approach achieves highly competitive results on all datasets while maintaining efficiency. Especially, it establishes a new state-of-the-art result on Something-Something and Moments-in-time by outperforming previous approaches in the literature by a large margin.
Activity classification has always been a challenging research topic, with first attempts reaching back by almost two decades Aggarwal2011Review
; deep-learning architectures nowadays achieve tremendous recognition rates on various challenging tasks, such as KineticsI3D:carreira2017quo , ActivityNet caba2015activitynet , or Thumos THUMOS14 .
Most successful architectures in the field are usually based on the so-called two-stream model Simonyan14TwoStream , processing a single RGB frame and optical-flow input in two separate CNNs with a late fusion in the upper layers. Over the last years, many approaches extend this idea by processing a stack of input frames in both streams, thus extending the temporal window of the architecture form 1 to up to 128 input frames per stream. To further capture the temporal correlation in the input over time, those architectures usually make use of 3D convolutions as, e.g., in I3D I3D:carreira2017quo , S3D S3D:xie2018rethinking , and ResNet3D ResNet3D:hara2017learning , usually leading to a large-scale parameter space to train.
Another way to capture temporal relations has been proposed by TSN:wang2016temporal , TRN:zhou2018temporal , and TSM:lin2018temporal . Those architectures mainly build on the idea of processing videos in the form of multiple segments, and then fusing them at the higher layers of the networks. The first approach with this pattern was the so-called Temporal Segment Networks (TSN) proposed by Wang et al. TSN:wang2016temporal . The idea of TSN has been extended by Temporal Relation Networks (TRN) TRN:zhou2018temporal , which apply the idea of relational networks to the modeling of temporal relations between observations in videos. Another approach for capturing temporal contexts has been proposed by Temporal Shift Modules (TSM) TSM:lin2018temporal . This approach shifts part of the channels along the temporal dimension, thereby allowing for information to be exchanged among neighboring frames. More complex approaches have been tried as well, e.g. in the context of non-local neural networks Wang2018NonLocal . Our temporal aggregation module is based on depthwise 11 convolutions to capture temporal dependencies across frames effectively.
Separate convolutions are considered in approaches such as S3D:xie2018rethinking ; R(2+1)D:tran2018closer to reduce costly computation in 3D convolutional models. More recently, SlowFast Network SlowFast:feichtenhofer2018slowfast uses a dual-pathway network to process a video at both slow and fast frame rates. The fast pathway is made lightweight, similar to Little Net in our proposed architecture. However, our approach reduces computation based on both a lightweight architecture and low image resolution. Furthermore, the recent work Timeception Timeception applies the concept of “Inception" to temporal domain for capturing long-range temporal dependencies in a video. The Timeception layers involve group convolutions at different time scales while our TAM layers only use depthwise convolution. As a result, the Timeception has significantly more parameters than the TAM (10% vs. 0.1% of the total model parameters).
We aim at developing efficient and effective video representations for video understanding. To address the computational challenge imposed by the desired long input to a model, we propose a new video architecture based on the Big-Little network (bLNet) chen2018biglittle for learning video features. We first give a brief recap of bLNet in Section 3.1. We then show, in Section 3.2, how to extend bLNet to an efficient video architecture that allows for seeing more frames with less computation and memory. An example of the proposed network architecture can be found in the supplementary material (Section A).
To make temporal modeling more effective in our approach, we further develop a temporal aggregation module (TAM) to capture short-term as well as long-term temporal dependencies across frames. Our method is implemented as a separate network module and integrated with the proposed architecture seamlessly to learn a hierarchical temporal representation for action recognition. We detail this method in Section 3.3.
The Big-Little Net, abbreviated as bLNet in chen2018biglittle , is a CNN architecture for learning strong feature representations by combining multi-scale image information. The bLNet processes an image at different resolutions using a dual-path network, but with low computational loads based on a clever design. The key idea is to have a high-complexity subnet (Big-Net) along with a low-cost one (Little-Net) operate on the low-scale and high-scale parts of an image in parallel. By such a design, the two subnets learn features complementary to each other while using less computation. The two branches are merged at the end of each network layer to fuse the low-scale and high-scale information so as to form a stronger image representation. The bLNet approach demonstrates improvement of model efficiency and performance on both object and speech recognition, using popular architectures such as ResNet, ResNeXt and SEResNeXt. More details on bLNet can be found in the original paper. In this work, we mainly adopt bLResNet-50 and bLResNet-101 as backbone for our proposed architecture.
We describe our architecture in the context of 2D convolutions. However our approach is not specific to 2D convolutions and potentially extendable to any architecture based on 3D convolutions.
The approach of Temporal Segment Networks (TSN) TSN:wang2016temporal provides a generic framework for learning video representations. With a shared 2D ConvNet as backbone, TSN performs frame-level predictions and then aggregates the results into a final video-level prediction (Fig. 1a)). The framework of TSN is efficient and has been successfully adopted by some recent approaches for action recognition such as TRN TRN:zhou2018temporal and TSM TSM:lin2018temporal . Given its efficiency, we also choose TSN as the underlying video framework for our work.
Let be a set of sampled input frames from a video. We divide into two groups, namely odd frames at half of the input image resolution, and even frames at the input image resolution. For convenience, from now on, is referred to as big frames and as little frames. Note that big branch can take either of a pair of frames as input and the other frame goes to the little branch.
In TSN, all input frames are ordered as a batch of size , where the element corresponds to the frame. We denote the input and output feature maps of the frame at the layer of the model by and , respectively. Whenever possible, we omit for clarity.
The bLNet can be directly plugged into TSN as the backbone network for learning video-level representation. We refer to this architecture as TSN-bLNet to differentiate it from the vanilla TSN (Fig. 1b)). This network fully enjoys the efficiency of bLNet, cutting the computational costs down by times according to chen2018biglittle . Mathematically, the output can be written as
is an operator scaling a tensor up or down by a factor ofin the spatial domain; and are the Big-Net and Little-Net in the bLNet aforementioned; and are the model parameters. Following chen2018biglittle , indicates an additional residual block applied after merging the big and little branches to stabilize and enhance the combined feature representation.
The architecture described above only learns features from a single frame, so there are no interactions between frames. Alternatively, we can feed the odd and even frames separately into the big and little branches so that each branch obtains complementary information from different frames. This idea is illustrated in Fig. 1c) and the output in this case can be expressed by
While the modification proposed above is simple, it leads to a new video architecture, which is called Big-Little-Video-Net, or bLVNet for short. The bLVNet makes two distinct differences from TSN-bLNet. Firstly, without increasing any computation, it can take input frames two times as many as TSN-bLNet. We shall demonstrate the benefit of leveraging more frames for temporal modeling in Section 4. Furthermore, the bLVNet has fewer FLOPs than TSN while seeing frames twice as many as TSN, thanks to the efficiency of the dual-path network. Secondly, the merging of the two branches in bLVNet now happens on two different frames carrying temporal information. We call this type of temporal interaction by local fusion, since it only captures temporal relations between two adjacent frames. In spite of that, local fusion gives rise to a significant performance boost for recognition, as shown later in Section 4.3.
Temporal modeling is a challenging problem for video understanding. Theoretically, adding a recurrent layer such as LSTM lstm:donahue2015longterm on top of a 2D ConvNet seems like a promising means to capture temporal ordering and long-term dependencies in actions. Nonetheless, such approaches are not practically competent with 3D ConvNets I3D:carreira2017quo , which use spatio-temporal filters to learn hierarchical feature representations. One issue with 3D models is that they are heavy in parameters and costly in computation, making them hard to train. Even though some approaches like S3D S3D:xie2018rethinking and R(2+1)D R(2+1)D:tran2018closer alleviates this issue by separating a 3D convolution filter into a 2D spatial component followed by a 1D temporal component, they are in general still more expensive than 2D ConvNet models.
With the efficient bLVNet architecture described above, our goal is to further improve its spatio-temporal representation by effective temporal modeling. The local fusion in bLVNet only exploits temporal relations between neighbored frames. To address this limitation, we develop a method to capture short-term as well as long-term dependencies across frames. Our basic idea is to fuse temporal information at each time instance by weighted channel-wise aggregation. As detailed below, this idea can be efficiently implemented as a network module to progressively learn spatio-temporal patterns in a hierarchical way.
Let be the output (i.e. neural activation) of the frame at a layer of the network (see Eq. 2). To model the temporal dependencies between and its neighbors, we aggregate the activations of all the frames within a temporal range around . A weight is learned for each channel of the activations to indicate its relevance. Specifically, the aggregation results can be written as
where indicates the channel-wise multiplication and is the weights. The
is defined as: for a vectorand a tensor with feature channels, .
We implement the temporal aggregation as a network module (Fig. 2). It involves three steps as follows,
apply 11 depthwise convolution times to input tensors to form an output matrix of size ;
shift the row left (or right) by positions if (or ) and if needed, pad leading or trailing zero tensors in the front or at the end;
perform temporal aggregation along the column to generate the output.
The aggregation module(TAM), highlighted as a red box in Fig. 1d), is inserted as a separate layer after the local temporal fusion in the bLVNet, resulting in the final bLVNet-TAM architecture. Obviously none of the steps in the implementation above involve costly computation, so the module is fairly fast. A node in the network initially only sees
neighbors. As the network goes deeper, the amount of context that the node involves in the input grows quickly, similar to how the receptive field of a neuron is enlarged in a CNN. In such a manner, long-range temporal dependencies are thus potentially captured. For this reason, the temporal aggregation is also calledglobal temporal fusion here, as opposed to the local temporal fusion discussed above.
The work of TSM TSM:lin2018temporal has also applied temporal shifting to swap feature channels between neighboring frames. In such a case, TSM can be treated as a special case of our method where the weights are empirically set rather than learned from data. In Section 4.3, we demonstrate that the proposed TAM is more effective than TSM for temporal modeling under different video architectures. TAM is also related to S3D S3D:xie2018rethinking and R(2+1)D R(2+1)D:tran2018closer in that TAM is independent of spatial convolutions. However, TAM is based on depthwise convolution, thus has fewer parameters and less computation than S3D and R(2+1)D.
The TAM can also be integrated into 3D convolutions such as C3D C3D:Tran2015learning and I3D I3D:carreira2017quo to further enhance the temporal modeling capability that already exists in these models. Due to the difference in how temporal data is presented between 2D-based and 3D-based models, the temporal shifting now needs to operate on feature channels within a tensor instead of on tensors themselves.
Datasets. We evaluate our approach on three large-scale datasets for video recognition, including the widely used Something-Something (Version 1 and Version 2) Something:goyal2017something , Kinetics-400 Kinetics:kay2017kinetics and the recent Moments-in-time dataset Moments:monfort2019moments . They are herein referred to as SS-V1, SS-V2, Kinetics-400 and Moments, respectively.
Something-Something is a dataset containing videos of 174 types of predefined human-object interactions with everyday objects. The version 1 and 2 include 108k and 220k videos, respectively. This dataset focuses on human-object interactions in a rather simple setup with no scene contexts to be exploited for recognition. Instead temporal relationships are as important as appearance for reasoning about the interactions. Because of this, the dataset serves as a good benchmark for evaluating the efficacy of temporal modeling, such as proposed in our approach. Kinetics-400 Kinetics:kay2017kinetics has emerged as a standard benchmark for action recognition after UCF101 ucf101:Soomro2012 and HMDB HMDB:Kuehne2011 , but on a significantly larger scale. The dataset consists of 240k training videos and 20k validation videos, with each video trimmed to around 10 seconds. It has a total of 400 human action categories.
Moments-in-time Moments:monfort2019moments is a recent collection of one million labeled videos, involving actions from people, animals, objects or natural phenomena. It has 339 classes and each video clip is trimmed to 3 seconds long.
Data Augmentation. During training, we follow the data augmentation used in TSN TSN:wang2016temporal
to augment the video with different sizes spatially and flip the video horizontally with 50% probability. Furthermore, since our models are finetuned on pretrained ImageNet, we normalize the data with the mean and standard deviation of the ImageNet images. The model input is formed byuniform sampling, which first divides a video into uniform segments and then selects one random frame from each segment as the input.
During inference, we resize the smaller side of an image to 256 and then crop a centered 224224 region. The center frame of each segment in uniform sampling is picked as the input. On Something-Something and Moments, our results are based on the single-crop and single-clip setting. On Kinetics-400, we use the common practice of multi-crop and multi-clip for evaluation.
Training Details. Since all the three datasets are large-scale, we train the models in a progressive way. For each type of backbone (for example, bLResNet-50), we first finetune a base model on ImageNet with a minimum input length (i.e. 8
2 in our case) using 50 epochs. We adopt the Nesterov momentum optimizer with an initial weight of 0.01, a weight decay of 0.0005 and a momentum of 0.9. We then finetune a new model with longer input (for example, 162) on top of the corresponding base model, but with 25 epochs only. In this case, the initial learning rate is set to 0.01 on Something-Something and 0.005 on Kinetics and Moments. The learning rate is decreased by a factor of at the 10-th and 20-th epoch, respectively.
This strategy allows to significantly reduce the training time needed for all the models evaluated in our experiments. All our models were trained on a server with 8 GPU cards and a total of 128G GPU memory. We set the total batch size to 64 whenever possible. For models that require more memory to train, we adjust the batch size accordingly to the maximum number allowed.
|Model||Backbone||Pretrain||Frames||Modality||Param (10)||FLOPs (10)||Val||Test|
|Top-1 (%)||Top-5 (%)||Top-1 (%)|
|NL I3D + GCN GCN:wang2018gcn||ResNet-50||ImageNet||32+32||RGB||303||62.2||46.1||76.8|
|Model||Backbone||Pretrain||Frames||Modality||Param (10)||FLOPs (10)||Val||Test|
|Top-1 (%)||Top-5 (%)||Top-1 (%)||Top-5 (%)|
|using their pretrained models and code to evaluate under the 1-crop and 1-clip setting for fair comparison|
|model ensemble of RGB and Flow model, each is evaluated with 3 crops and 10 clips and uses 256 as the shorter side.|
Something-Something. We first report our results on the validation set of the Something-Something datasets in Table 1 and Table 2. With a moderately deep backbone bLResNet-50, our approach outperforms all 3D models on SS-V1 while using much fewer input frames (82) and being substantially more efficient. TSM TSM:lin2018temporal was the previously best approach on Something-Something. Under the same backbone (i.e. ResNet-50), our approach is better than TSM on both SS-V1 and SS-V2 while being more efficient (i.e our 8x2 model has times fewer FLOPs than a 8-frame TSM model).
When empowered with a stronger backbone bLResNet-101, our approach achieves even better results at 322 frames (53.1% top-1 accuracy on SS-V1, and 65.2% on SS-V2), establishing a new state-of-the-art on Something-Something. Notably, these results while based on RGB information only, are superior to those obtained from the best two-stream models at no more computational cost. This strongly demonstrates the effectiveness of our approach for temporal modeling. We further evaluated our models on the test set of Something-Something. Our results are consistently better than the best results reported by the other approaches in comparison including 2-stream models.
|Net||Backbone||Pretrain||FLOPs (10)||Top-1 (%)||Top-5 (%)|
|Net||Backbone||Pretrain||Frames||Modality||Top-1 (%)||Top-5 (%)|
Kinetics-400. Kinetics-400 is one of the most popular benchmarks for action recognition. Currently the best-performed models on this dataset are all based on 3D Convolutions. However, it has been shown in the literature that temporal ordering in this dataset does not seem to be as crucial as RGB information for recognition. For example, as experimented in S3D S3D:xie2018rethinking , the model trained on normal time-order data performs well on the time-reversed data on Kinetics. In accordance to this, our approach (3 crops and 3 clips) mainly performs on par with or better than the current large-scale architectures, but without outperforming them as clearly as on the Something-Something datasets, where the temporal relations are more essential for an overall understanding of the video content.
Moments. We finally evaluate the proposed architecture on the Moments dataset Moments:monfort2019moments , a large-scale action dataset with about three times more training samples than Kinetics-400. Since Moments is relatively new and results reported on it are limited, we only compare our results with those reported in the Moments paper Moments:monfort2019moments . As can been seen from Table 4, our approach outperforms all the single-stream models as well as the ensemble one. We hope our models provide stronger baseline results for future reference on this challenging dataset.
It is also noted that our model trained with frames only produces slightly better top-1 accuracy than the model trained with frames. We speculate that this has to do with the fact that the Moments clips are only as short as 3 seconds and that there is only a limited impact in choosing a finer temporal granularity on this dataset.
In this section, we conduct ablation studies to provide more insights about our main ideas.
Is temporal aggregation effective?. We validate the efficacy of the proposed temporal aggregation module (TAM), which is considered as a global fusion method (Section 3.3). Local fusion here is referred to the branch merging in the dual path network (Section 3.2). We compare TAM with the temporal shift module used in TSM TSM:lin2018temporal in Table 5 under two different video architectures: TSN and bLVNet proposed in this work. TAM demonstrates clear advantages over TSM, outperforming TSM by over 2% under both architectures. Interestingly, with the here proposed bLVNet baseline with local temporal fusion almost doubles the performance of a TSN baseline, improving the accuracy from 17.4% to 33.6%. On top of that, TAM boosts the performance by another 13% in both cases, suggesting that TAM is complementary to local fusion. This further confirms the significance of temporal reasoning on the Something-Something dataset.
|Net||Backbone||Local Fusion||Global Fusion||Top-1 (%)|
Does seeing more frames help?. One of the main contribution of this work is an efficient video architecture that makes it possible to train deeper models with more input frames using moderate GPU resources. Fig. 3a) shows consistent improvement of our approach on SS-V1 as the number of input frames increases. A similar trend in our results can be observed on Kinetics-400 in Table 3. On the other hand, the almost flattened line from TSN suggests that a model without effective temporal modeling cannot take much of the benefit from longer input frames.
Memory Usage. We compare the memory usage between our approach based on bLResNet-50 and TSN based on ResNet-50. As shown in Fig. 3b), our approach is more memory friendly than TSN, achieving a saving of 2 times at the same number of input frames. The larger batch size allowed for training under the same computational budget is critical for our approach to obtain better models and reduce training time.
We presented an efficient and memory-friendly video architecture for learning video representations. The proposed architecture allows for twice as many input frames as the baseline while using less computation and memory. This enables training of deeper models with richer input under the same GPU resources. We further developed a temporal aggregation method to capture temporal dependencies effectively across frames. Our models achieve strong performance on several action recognition benchmarks, and establish a state-of-the-art on the Something-Something dataset.
Here we details our network architecture for bLVNet-TAM-50 in Table 6. We follow the notation used in bLNet chen2018biglittle ( and ) but adding the proposed TAM module before branching out to Big-Net and Little-Net and the last shared residual block. As noted before, the two branches work on different frames and then merged every stage; on the other hand, in ResBlock, the TAM module goes through the non-shortcut path.
|Layers||Spatial output size||bLVNet-TAM-50|
|TAM-module||Temporal Aggregation Module ()|
|TAM-module||Temporal Aggregation Module ()|
|ResBlock, , s2|
|TAM-module||Temporal Aggregation Module ()|
|ResBlock, , s2|
|TAM-module||Temporal Aggregation Module ()|
|ResBlock||ResBlock, , s2|
|Average pool||average pooling|
|FC, softmax||# of classes|
|ResBlock: the first
convolution is with stride 2, and then restoring the size via the bi-linear upsampling.
|ResBlock: a convolution is applied at the end to align the channel size.|
|ResBlock: a residual block embedded with temporal aggregation module with .|
|s2: the stride is set to 2 for the convolution in the ResBlock.|
Here we describe how we convert the video data into images for our training and inference. For the Something-Something dataset, we resize the smaller side of an image to 256 while keeping aspect ratio. For the Kinetics dateset, we resize the smaller side of an image to 331 since its original resolution is higher. For the Moments dataset, we we resize an image to 256256.