Lightweight Network Architecture for Real-Time Action Recognition

05/21/2019 ∙ by Alexander Kozlov, et al. ∙ Intel 0

In this work we present a new efficient approach to Human Action Recognition called Video Transformer Network (VTN). It leverages the latest advances in Computer Vision and Natural Language Processing and applies them to video understanding. The proposed method allows us to create lightweight CNN models that achieve high accuracy and real-time speed using just an RGB mono camera and general purpose CPU. Furthermore, we explain how to improve accuracy by distilling from multiple models with different modalities into a single model. We conduct a comparison with state-of-the-art methods and show that our approach performs on par with most of them on famous Action Recognition datasets. We benchmark the inference time of the models using the modern inference framework and argue that our approach compares favorably with other methods in terms of speed/accuracy trade-off, running at 56 FPS on CPU. The models and the training code are available.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Accuracy vs complexity trade-off for different methods on Kinetics-400 validation set. First three models are the variants of the proposed VTN method. ResNet-34 3D with a similar number of MAC (accepts smaller resolution inputs) is presented for comparison. We also included several state-of-the-art methods: I3D [9], R(2+1)D [47], S3D-G [53], NL-C2D [52].

The latest advances in the Computer Vision domain are definitely related to the development of Deep Learning (DL) methods

[42, 45, 23] which show great results on many tasks such as Image Classification [14] and Segmentation [12], Object Detection [16, 32], etc. There is a tendency nowadays to create more and more sophisticated pipelines [22, 57, 7], combining quite complex components which solve the task well but require a massive amount of calculations and power at the same time. On the other hand, since the times of AlexNet [30] and VGG [42] where a vanilla convolution was used as a basic building block, new lightweight primitives have been proposed [27, 10, 26, 56], allowing to reduce the theoretical complexity but retain or even improve the final accuracy. However, video-level tasks, such as Human Action Recognition, which is being discussed in this work, require to consider temporal structure of input data by aggregating information from multiple frames in order to solve action ambiguities (opening/closing the door). This inevitably incurs extra computational costs during inference of the model. Nevertheless, few studies [8] pay attention to the complexity of the algorithm while maximizing accuracy. Therefore, creating a solution that can achieve high accuracy providing a fast inference speed would be a relevant task, especially in the case of low-power devices used for edge computing (at the edge).

Following this idea, we propose a lightweight architecture for AR which can run in real-time on a regular CPU, performing on par with heavy methods, such as 3D CNN [46, 9, 47]. In support of this, we provide a comparison (see Fig. 1 and Section 4.3) of our model with the state-of-the-art methods and verify its accuracy on modern benchmarks, such as Kinetics [28], UCF-101 [43], and HMDB-51 [31].

Shortly, our contributions can be summarized as follows:

  • A new lightweight CNN architecture for real-time Action Recognition that achieves results comparable to state-of-the-art methods.

  • Comparison of modern approaches to Action Recognition.

  • A method for improving the accuracy of an existing model by accommodating information from additional modality without a discernible increase in complexity.

Figure 2: Overview of the VTN architecture. input frames are fed to CNN encoder and global pooled to get frame embeddings. Then the decoder block (see in details in Fig. 3) is applied

times. In the end, the clip logits are produced by averaging all frame logits.

2 Related Work

Currently, there are multiple methods that solve the AR problem with certain quality.

One of the examples is the two-stream framework that fuses information from spatial and temporal nets [41, 18]. Spatial net uses RGB frame as input and represents an ordinary classification CNN working on a frame level, whereas temporal net receives multiple stacked optical flow (OF) frames. Calculating OF with traditional algorithms, such as TVL1 [55], requires extra resources, but there are several ways to avoid it. For example, OF can be extracted with additional sub-network [44] or RGB difference [51] can be used as an alternative motion representation.

Another popular group of methods is related to the use of 3D primitives like 3D Convolution, 3D Batch Normalization, 3D Pooling, and others. They generalize original operations introducing an additional dimension

, which indicates the sequence of frames. One of the first architectures that leveraged these primitives for the application to AR, is C3D [46]. Another famous 3D CNN, which saturated UCF-101 benchmark [43], is I3D [9]

. It benefits from pre-training on a large-scale ImageNet

[14] dataset by inflating trained 2D filters into 3D. Although methods based on 3D convolutions allow improving results in terms of accuracy, the computational expenses may achieve dozens of GFLOPs. Another substantial drawback is that at some level of the network only a small number of weights inside the convolutional kernels have a significant impact on the output signal regarding their contribution to the absolute value of activations making utilization of resources ineffective. This problem was mentioned in [47, 53] where authors proposed decomposition techniques and mixed architectures that combine 3D and 2D operations on different levels of the network.

Recurrent neural networks, LSTMs [25], and GRUs [11] have been regarded as the default starting point for many sequence modeling problems, such as machine translation or language modeling [20]. Many significant results have been achieved in several challenging tasks by means of employing recurrent networks and attention mechanism [40, 4]. Not surprisingly, several approaches to video classification that model sequences with recurrent connections or gated units have been proposed [54, 39, 15]. These models, while showing comparable results on many benchmarks [9]

, seem to be more suitable for online prediction and thus real-time applications, because feature vector computed for the frame can be reused for predicting classification label for multiple time-windows containing this frame.

Several viable alternative approaches to sequence modeling have been proposed recently. These approaches, for example convolutional [5] or fully-attentional (e.g. Transformer [48]) networks, achieve better results on many tasks while addressing significant shortcomings of RNNs such as sequential computing or gradient vanishing.

We adopt recently proposed Transformer network in our work as a more elaborate way for sequence modeling. This allowed us to attain high accuracy, retaining the performance, that is sufficient for real-time applications.

3 Approach

In this section, we describe a designed approach to AR problem in details as well as discuss some improvements that help to boost the accuracy of our baseline architecture without significantly increasing the complexity.

3.1 Architecture overview

Figure 3: The detailed overview of decoder block used in VTN. We use

self-attention heads on the scheme for simplicity. Each head independently transforms input sequence embeddings to its query, key, value triplet using three trainable linear transformations and applies the self-attention operation. In order to produce output sequence, resulting vectors are concatenated and passed to the block of two convolutions with the kernel of size 1 and residual connection around those convolutions.

Video Transformer Network (see Fig. 2) consists of two parts: the first is the encoder that processes each frame of input sequence independently with 2D CNN in order to get frame embeddings, and the second is the decoder that integrates intra-frame temporal information in a fully-attentional feed-forward fashion, producing the classification label for the given clip. ResNet-34 is used [23]

as a baseline architecture for the encoder in most of our experiments. We reuse parameters of all convolutional layers to maximize the benefit of transfer learning from image classification tasks. Global average pooling is then applied to the resulting feature maps to get the frame embeddings of size

(that is equal to 512 in our case), which are then transformed by the decoder, by repeatedly applying multi-head self-attention and convolutional blocks. In multi-head self-attention block, a temporal interrelationship between frames is modeled by informing each frame representation by representation of other frames using the attention mechanism. It consists of several sequential operations. First, vectors of frame representations are mapped to multiple key, value, and query spaces using different learned affine transformations. Each triple of query , key , value matrices (where is the sequence size and , are the dimensions of key and value space accordingly) is then transformed to the corresponding head output using the scaled multiplicative attention as following:


Each head output is then concatenated and passed to the convolutional block that consists of two convolutions with kernel of size 1 (position-wise feedforward) and residual connection. Resulting frame representations are then refined by applying the same procedure multiple times. As we found experimentally, four stacks of such decoder blocks are sufficient for maximizing classification accuracy, and the further increase of the number of blocks did not lead to improvement. In order to produce action confidences for the current clip, a fully-connected layer is applied to all elements of the sequence. Resulting scores are then averaged and normalized with softmax function producing the clip prediction.

3.2 Multimodal knowledge distillation

As it was discussed above, the fusion of results of models that receive inputs with different modalities is a common approach to improve the accuracy of Action Recognition algorithm. But in most cases, it leads to a substantial increase in computational complexity due to several reasons. First, it requires to calculate a new modality, which itself may be a hard task, especially in case of the optical flow where commonly used algorithms perform costly iterative energy minimization. Second, since the same architecture is used to do prediction using the second modality, the complexity of the method is doubled. Therefore, both issues make applying of multimodal solutions difficult in real-world applications.

On the other hand, using the RGB difference in place of the optical flow results in almost the same performance [51], which has been verified by our experiments. At the same time, it requires much lower computational resources that makes using this modality more suitable in conjunction with a still RGB data.

Knowledge distillation [24] is the procedure that designated to help optimization of the student network by providing extra supervision from a larger model or an ensemble of models (teacher). There are successful applications of this technique for reducing the complexity of a larger teacher network [36] or integrating the performance of an ensemble of models into a single student [6, 24]. However, we hypothesize whether it is possible to transfer knowledge from multiple models working on different modalities (two-stream teacher) to a single student. In order to better understand this, we ran several experiments where knowledge from two ResNet-34 based VTN models working with RGB and RGB difference is distilled to the single RGB model and to the model which receives stacked RGB and RGB difference inputs. We also tried to train a model that operates on stacked input without extra supervision from knowledge distillation. Results are shortly summarized in Table 1. The model working on stacked inputs outperforms the single modality model when trained with knowledge distillation. We suppose that the main reason for that is that motion representation, learned by RGB-difference subnetwork in the two-stream teacher, are not discovered by RGB-only model, yet they significantly contribute to model performance. Note that this technique does not allow matching the performance of the two-stream model. However, it significantly reduces the complexity compared with the original two-stream solution.

Model Video@1 GMAC222Billion of multiply-accumulate operations.
Fused RGB + RGB-diff (teacher) 78.2 7.51
RGB 75.2 3.77
RGB with KD 75.2 3.77
Stacked RGB + RGB-diff 75.2 3.88
Stacked RGB + RGB-diff with KD 76.0 3.88
Table 1: Results of knowledge distillation (KD) from two-stream (fusion of two models) ResNet-34-VTN teacher on Mini-Kinetics dataset. The single model that works with stacked modalities improves its accuracy when trained as a student in knowledge distillation setup. However, RGB-only model does not benefit from KD.

4 Experiments

In this section we present a study of the proposed method. Kinetics-400 is considered as the primary benchmark. However, the smaller Mini-Kinetics subset that was introduced in [53] is also used for faster experimentation. We also evaluated our models on UCF-101 and HMDB-51 and evaluated the inference speed on CPU.

4.1 Implementation details

We train and validate our models on 16-frame input sequences that are formed by sampling every second frame from the original video, therefore the total temporal receptive field of our model equals to 32 frames. We tried longer sequences by adding or skipping more frames, but this only resulted in an increased clip accuracy, not the video. In order to calculate video classification accuracy (Video@1), we extracted all non-overlapping 32 frame segments and averaged prediction on these segments.

Frames are scaled in a way, that the shorter side becomes equal to 256. We randomly crop with four different scales during training, as described in [50], and use central crop during the test time. Adam optimizer [29] with the momentum of 0.9 and weight decay of 0.0001 is used. Training is started with the learning rate of

, which is decayed by a factor of 10 when validation loss reaches a plateau. Models are trained until validation loss stops decreasing, which is usually happened within 50 epochs.

4.2 Model hyperparameters

We varied the structure of our decoder block in order to come up with one that maximizes performance on Mini-Kinetics dataset and believed that the same parameter settings would maximize efficiency on other datasets.

First of all, we evaluated how the number of stacked decoder blocks affects accuracy. We trained models with 1,3,4,5 and 6 blocks, and determined that 4 blocks result in the maximal accuracy and the higher number of blocks does not further boost the metric. We also experimented with sharing parameters between blocks by applying one block recurrently, as suggested in [13], but it did not lead to performance improvement. We varied the number of heads in multi-head self-attention, and dimension of query, key , and value space, heads with gave the best results. We also tried to add trainable linear transformation after concatenation of heads and to use layer normalization in different locations, but these changes did not affect the accuracy.

4.3 Comparison with other methods

Model Mini-Kinetics UCF-101 MAC FPS Parameters
3D CNN 72.9 86.4 50.2G 5 63.5M
Fused RGB and OF 74.3 89.8 8.5333

Optical flow calculation is not included in the complexity estimation.

32 42.8M
Fused RGB and RGB-diff 73.7 88.3 9.1 30 42.9M
Stacked LSTMs 72.0 86.6 3.7 55 27.6M
VTN (ours) 75.2 89.0 3.8 56 29.0M
Table 2: Comparison of different approaches to Action Recognition on Mini-Kinetics dataset with further finetuning on UCF-101 split 1 (Accuracy Video@1). All models are based on the ResNet-34, with the input resolution of 224x224 and 16-frame inputs. Inference time was measured on Intel CoreTM i7-8700 CPU @ 2.90GHz and expressed in Frames Per Seconds.

In order to better understand capabilities of the proposed approach, we compare it with methods described in Section 2. For a fair comparison, we take ResNet-34 architecture and extend it to the case of 3D networks and two-stream methods in the way described below.

The first model we compare with is ResNet-34 3D which is described in [21]. It repeats a common ResNet architecture, but instead of 2D Convolutions and Pooling layers, it utilizes their 3D analogs. A global Average Pooling operation over three dimensions is applied at the end of the network in order to get a representation vector, which is fed to a fully-connected layer producing the CNN output. Vanilla ResNet-34 pre-trained on ImageNet is used to initialize its 3D analog where convolutional kernels are repeated over temporal dimension , as proposed in [9].

The next approach that we consider is a two-stream model that is represented by a fusion of two ResNet-34 CNNs trained on RGB and OF inputs. The OF model is almost the original ResNet-34, but its first convolutional layer receives 32-channels input, formed by and components of pre-calculated optical flow for 16 sequential frames. To initialize this layer we average the first convolutional kernel of the RGB model pre-trained on ImageNet over the channel dimension and repeat it 32 times.

We also tried a two-stream model where two fused CNNs were trained on RGB and RGB difference inputs since the calculation of the latter is much cheaper than the optical flow. In this case, the motion model receives 48-channels input of RGB differences from 16 consecutive frames.

The last model examined in our comparison is the ResNet-34 followed by three stacked LSTM cells operating on independent frame embeddings. As before, we use the ImageNet pre-trained model for initialization, but learn LSTM parameters from scratch. We found this model quite simple but representative at the same time. We also tried to apply a visual attention mechanism, as suggested in [39], but it did not improve the performance.

The comparison of the described models and our proposed method is shown in Table 2. For the sake of convenience, we also provide a theoretical complexity and inference time for all models. The input resolution is set to 224x224, and the sequence size is 16 frames for all models. The models were trained with Adam optimizer until validation loss reaches the plateau. The obtained results show that our VTN model outperforms others on Mini-Kinetics dataset and works on par with the two-stream method. We find this fact surprising because we believe that 3D Convolutional model should perform better because it consists of operations that can learn temporal dependencies at every layer and has a higher capacity regarding the number of parameters.

Another interesting result is that the two-stream RGB-difference model shows the performance that is close to the OF-based model while saving a large number of calculations. These findings correspond to the results of [21, 35]. Nevertheless, our VTN approach is attractive in terms of speed/accuracy trade-off.

4.4 Comparison with state-of-the-art

Method Video@1
BNInception+TSN-RGB [51] 69.1444Author’s implementation ( uses 10-crop TTA during testing.
I3D-RGB [9] 72.1
I3D-TwoStream [9] 75.7
S3D-G [53] 74.7
R(2+1)D-TwoStream [47] 75.4
R(2+1)D-RGB [47] 74.3
NL-I3D-ResNet-101-RGB [52] 77.7
MobileNetV2-VTN-RGB 62.5
ResNet-34-VTN-RGB 68.3
ResNet-34-VTN-RGB+RGBDiff 71.0
SE-ResNeXt-101-VTN-RGB 69.5
SE-ResNeXt-101-VTN-RGB+RGDiff 73.5
Table 3: Comparison with the state-of-the-art on Kinetics-400 dataset.

To compare with other state-of-the-art models we assessed our approach on Kinetics-400 dataset. In addition to the baseline ResNet-34-VTN, we used a larger model employing SE-ResNeXt-101 (32x4d) architecture for the encoder, which is, however, still very cheap in terms of a number of multiply-accumulates in comparison with 3D CNNs. Another interesting question is the potential of the proposed method in optimizing a model for mobile devices and what associated drop in accuracy it would incur. To tackle this question we tested our approach with the lightweight MobileNetV2 [38] encoder.

Since fusion of prediction from streams with different modalities (e.g. RGB and optical flow or RGB and RGB difference) allowed improving results in many published works, we experimented with enhancing the results of our RGB model by combining it with the analogous RGB difference model. We subtracted normalized adjacent frames and trained the ResNet34-VTN model on this data. This allowed us to improve the results of the ResNet34-VTN model by a margin of 2.4%.

The results for the Kinetics-400 validation set are presented in Table 3. The breakthrough I3D model [9] outperforms ResNet-34 VTN and SE-ResNeXt-101 (32x4d) VTN only by a small margin of 3.5% and 2.1% accordingly, thus our method still shows competitive results while being computationally significantly cheaper for online prediction scenarios.

We also provide results on the popular UCF-101 and HMDB-51 datasets. We fine-tuned models trained on Kinetics-400 for 20 epochs with smaller learning rate of . Mean video accuracies over three validation splits are presented in Table 4.

Computational complexity versus accuracy on Kinetics-400 for some state-of-the-art methods and various variants of VTN is shown in Fig. 1. Since we primarily focus on the online prediction scenario (i.e. when the classification label is required for every subsequent frame) we consider the number of operations needed to execute the encoder on one frame as well as operations for the whole decoder. On the other hand, 3D convolutional models extract features from adjacent frames and require to execute the entire network for each new frame. Thus our method is more attractive in terms of accuracy/complexity for real-time applications.

Method UCF-101 HMDB-51

IDT [49]
86.4 61.7
C3D [46] 85.2 -
Two-Stream [41] 88.0 59.4
Two-Stream Fusion + IDT [19] 93.5 69.2
BNInception+TSN-RGB [51] 91.1 -
P3D [51] 88.6 -
ST-ResNet + IDT [17] 94.6 70.3
I3D-RGB [9] 95.6 74.8
I3D-TwoStream [9] 98.0 80.7
S3D-G [53] 96.8 75.9
R(2+1)D-TwoStream [47] 97.3 78.7
ResNet-34-VTN-RGB 90.8 63.5
SE-ResNeXt-101-VTN-RGB 92.2 67.2
ResNet-34-VTN-RGB+RGBDiff 95.0 71.3
95.0 71.6
Table 4: Comparison with other methods on UCF-101 and HMDB-51 (average metric over all splits). Methods of the first set of rows do not use Kinetics pre-training.

4.5 Inference speed

Since theoretically faster models do not necessarily correspond to higher inference speed [34, 33, 37]

, we also evaluate the actual inference time to prove the feasibility of the proposed method for real-time applications. Currently, there are several frameworks available, such as Nvidia Tensor RT

[1] or Intel® OpenVINOTM Toolkit [3], which can highly optimize DL model for particular hardware. Since we primarily focus on models suitable for edge computing, we chose OpenVINO and its DL Deployment Toolkit as the inference engine for our solution. OpenVINO can import models from many DL frameworks as well as ONNX [2]

representation which we use to convert models from PyTorch framework which is used in all our experiments.

ResNet-34-VTN-RGB 56 3.77
Stacked RGB+RGBDiff
ResNet-34 VTN
51 4.2
ResNet-50-VTN-RGB 49 4.25
MobileNetV2-VTN-RGB 177 0.4
Table 5: Inference time of various Video Transformer Networks with OpenVINO on Intel CoreTM i7-8700 CPU @ 2.90GHz.

Table 5 shows the inference time on CPU of several models that employ the proposed approach. Faster than real-time speed is achieved for all models, making this method promising for edge computing.

5 Conclusions

In this work, we have proposed a new Video Transformer Network architecture for real-time Action Recognition. We have shown that adopting methods from Natural Language Processing along with using an appropriate CNN for Image Classification helps to achieve accuracy on-par with state-of-the-art methods. Moreover, it has been demonstrated that the proposed approach favorably compares with other approaches, such as 3D Convolution-based models or two-stream methods. Specifically, it allows utilizing computational resources more effectively by embedding each input frame to lower-dimensional high-level feature vector and then making a conclusion about the action operating only on embedding vectors by means of self-attention. This method allows achieving real-time inference on a general-purpose CPU, providing capabilities for using AR algorithms at the edge. Our research also demonstrates that the self-attention mechanism is quite universal and can be applied to many tasks, such as Natural Language Processing, Speech Recognition or Computer Vision.