MixTConv: Mixed Temporal Convolutional Kernels for Efficient Action Recogntion

01/19/2020 ∙ by Kaiyu Shan, et al. ∙ Peking University 0

To efficiently extract spatiotemporal features of video for action recognition, most state-of-the-art methods integrate 1D temporal convolution into a conventional 2D CNN backbone. However, they all exploit 1D temporal convolution of fixed kernel size (i.e., 3) in the network building block, thus have suboptimal temporal modeling capability to handle both long-term and short-term actions. To address this problem, we first investigate the impacts of different kernel sizes for the 1D temporal convolutional filters. Then, we propose a simple yet efficient operation called Mixed Temporal Convolution (MixTConv), which consists of multiple depthwise 1D convolutional filters with different kernel sizes. By plugging MixTConv into the conventional 2D CNN backbone ResNet-50, we further propose an efficient and effective network architecture named MSTNet for action recognition, and achieve state-of-the-art results on multiple benchmarks.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Action recognition, which aims at assigning corresponding labels to the given videos, is a fundamental task for many real-world applications such as human-computer interaction and urban security systems. Temporal information is very important for action recognition. For example, it is hard to distinguish between “pulling something right to left” and “pulling something left to right” without temporal information[4]. Hence, how to model both spatial and temporal information of video, i.e., extracting spatiotemporal features of video, is crucial for action recognition.

2D CNN-based action recognition approaches[8, 19, 23, 3] individually extract spatial features on sampled frames, which is efficient but struggle with temporal information modeling. On the contrast, 3D CNN-based methods[1, 17] jointly learn spatiotemporal features and achieve higher recognition accuracy, but bring huge computational cost. To address these problems, most state-of-the-art methods[22, 18, 11] integrate 1D temporal convolution into conventional 2D CNN to achieve good trade-off between efficiency and accuracy. Despite their success, their ability of modeling both long-term and short-term actions is not optimal since they all exploit 1D temporal convolution of fixed kernel size (i.e., 3) in the network building block. Moreover, they employ ordinary 1D convolution along temporal dimension, thus their efficiency can be further improved by depthwise convolution.

To tackle the issues mentioned above, we first study different lightweight temporal convolution methods and the impact of different kernel sizes along the temporal dimension. And we find that: 1) depthwise 1D convolution performs better than ordinary 1D convolution with large computation saving; 2) large kernel size of depthwise 1D convolution for temporal modeling does not always bring higher accuracy; 3) utilizing both large kernel and small kernel along the temporal dimension can capture long-term and short-term temporal information simultaneously, leading to better accuracy and efficiency.

According to these findings, we propose a simple yet efficient temporal operation for spatiotemporal feature extraction, i.e.,

Mixed Temporal Convolution (MixTConv). This operation partitions input channels into groups and performs depthwise 1D convolution with different kernel sizes to each group, such that it can extract temporal features of different scales. It has several superiorities: 1) compared with the 3D convolution and ordinary 1D convolution, it enjoys both higher accuracy and efficiency by extract multi-scale temporal features; 2) it keeps the same size between input and output, thus it is plug-in-play and can be flexibly inserted into any 2D CNNs. For instance, we insert the proposed MixTConv into the residual block of ResNet50[5] to build a Mixed Spatiotemporal Network (MSTNet) for action recognition, and experimental results on various large-scale public datasets show that: 1) the proposed MixTConv operation can significantly improve recognition accuracy of 2D CNN baseline by 27.6% (from 20.5% to 48.1%) on Something-Something v1 and 31.4% (from 30.4 to 61.8%) on Something-Something v2; 2) compared with state-of-the-art method TSM[9], MSTNet obtains considerable recognition accuracy improvement more than 1.1% with negligible parameter and computational cost.

To sum up, the contributions of this work are threefold:

  • We propose a novel Mixed Temporal Convolution operation (MixTConv) for efficient spatiotemporal feature extraction, which mixes up multiple depthwise 1D convolutional filters with different kernel sizes.

  • We further propose a very efficient and effective network architecture named MSTNet for action recognition by plugging MixTConv into conventional 2D CNN, which significantly improves recognition accuracy of the 2D CNN baseline.

  • We achieve state-of-the-art results on multiple benchmarks, including Something-Something v1, Something-Something v2 and Jester.

2 Related Work

Spatiotemporal modeling 2D CNN-based methods[19, 3] directly use frame-wise prediction aggregation. For example, Simonyan et al.[3] designs a two-steam CNNs network by combining RGB input and optical flow results. TSN[19] divides the video into N segments and samples one frame from each segment, then consensus the result by averaging. Despite their high efficiency, 2D CNN-based methods perform poorly on the action videos due to their weakness of temporal modeling. 3D CNN-based methods[7, 17, 1] jointly learn spatiotemporal features in an elegant way. Tran et al.[17] proposes C3D based on VGG backbone[14] to capture temporal features from a frame sequence. I3D[1] introduces a 3D ConvNet based on 2D ConvNet inflation by expanding the filters and pooling kernels in an Inception V1 model[15]

into 3D convolutional kernels, so that it can leverage 2D network architecture designed for image classification and even their parameter weights pre-trained on ImageNet

[2]. However, due to the model complexity, the pure 3D convolutional networks are resource-costly and prone to overfit[22]. Hence, several methods focus on decomposing the 3D convolutions into separate 2D spatial and 1D temporal filters[22, 18, 11]. However, these methods still suffer from computational cost due to usage of ordinary 1D convolution.

Efficient operations and modules for temporal modeling Some methods attempt to trade off performance and computation by proposing different modules or temporal operations[6, 23, 9]. TRN[23] adds temporal fusion after feature extraction, leading to limited improvement of performance. TSM utilizes shifting operation which shifts a portion of the channels along the temporal dimension. Essentially, this operation is a fixed weight depthwise 1D convolution, which is not flexible enough for temporal modeling. Timeception[6] is another module which uses depthwise 1D convolution with different kernel sizes. However, it is quite different from our MixTConv with a more complex structure. Specifically, a Timeception layer divides input channels into several groups, and each group consists of multiple branches. Concretely, each branch is composed of a depthwise 1D convolution with different kernel sizes and a following 2D convolution layer. In contrast, our MixTConv is much more efficient: we divide the input channels into multiple groups and perform depthwise 1D convolution with different kernel sizes on each group. Moreover, the way of [6] to integrate the module is different from ours. In the network of [6], four Timeception layers are stacked on top of the last convolution layer of a 3D CNN or 2D CNN, which is the late fusion as the same as TRN. As a contrast, MixTConv is inserted into all the blocks of the 2D CNN backbone to build our MSTNet.

Mixed Convolution  MixConv[16] uses 2D spatial convolution filters of different kernel sizes to extract spatial features of various resolutions, for improving image recognition accuracy. Differently, our MixTConv mixes up multiple depthwise 1D convolutional filters with different kernel sizes to capture both long-term and short-term temporal information, for boosting action recognition performance.

3 Methodology

In this section, we first introduce the proposed Mixed Temporal Convolution operation(MixTConv) in sec  3.1. Then, the Mixed Spatiotemporal Block (MST Block) which intergrates MixTConv into 2D residual block is presented in sec  3.2. Finally, our proposed video recognition network Mixed Spatiotemporal Network (MSTNet) is introduced in sec  3.3.

Figure 1: The pipeline of the proposed video action recognition network Mixed Spatiotemporal Network(MSTNet), base on the Mixed Temporal Convolution. ”Ks” means kernel size, and ”DW” means depthwise.
Figure 2: Comparison of different temporal operations. (a)shift temporal operation with fixed kernel weight and kernel size. (b) learnable temporal operation with the fixed kernel size of depthwise 1D convolution. (c) Mixed Temporal Convolution(MixTConv) with different kernel sizes of depthwise 1D convolution.
Figure 3: Comparision for MST Block head and MST Block inner.

3.1 MixTConv: Mixed Temporal Convolution

MixTConv is designed for efficient and effective temporal modeling. To achieve that, it has three engaging properties: 1) Unlike 3D CNNs which simultaneously convolve the spatial and temporal dimensions, our proposed MixTConv models these subspaces separately by decomposing the spatiotemporal modeling, and focus on temporal modeling; 2) Unlike the existing 2+1D methods that use ordinary 1D convolution, MixTConv applies depthwise 1D convolution, which significantly reduces the computation by a factor of C, where C is the number of input channels. 3) The depthwise fashion allows us to mix multiple depthwise 1D convolutions with different kernel sizes, thus can extract multi-scale temporal features and significantly boost the performance from the baseline with minor additional computation.

Here and after, We denote the input feature map for MixTConv operation as , where , , , , is the batch size, height, weight, number of sampled frames and channel size, respectively. As illustrated in Figure 1, we firstly reshape as: , and then apply the depthwise 1D convolution with different kernel sizes {} on the temporal dimension. Let denotes a depthwise 1D convolutional kernel with kernel size of . Unlike vanilla depthwise convolution, MixTConv partitions channels into groups {} and applies depthwise 1D convolution with different kernel sizes to each group, where denotes channels in the - group. Formally, the mixed 1D convolution is defined as:


Where and is the value of at the - frame and - channel.

The final output tensor is a concatenation of all the output tensor {

} :


Discussion Our method is related to the current state-of-the-art method TSM[9]. In fact, the shift operation is a special case of our proposed MixTConv, more specifically, equal to a fixed weight depthwise 1D convolution with fixed kernel size of 3, where temporal kernel is fixed as: [0, 1, 0] for static channels(3/4 of total channels), [1, 0, 0] for backward-shift channels(1/8 of total channels), and [0, 0, 1] for forward-shift channels(1/8 of total channels), shown in Figure 2(a). Our experiment shows that, using depthwise 1D convolution with learnable weight (Figure 2(b)) and multiple kernel sizes (Figure 2(c)) along temporal dimension is more effective than these hand-crafted temporal kernels to capture pyramidal temporal contextual information.

3.2 Mixed Spatiotemporal Block

Our proposed MixTConv can be flexibly plugged into any existing 2D architectures with limited computational cost, thus can extract spatiotemporal features efficiently. As illustrated in Figure 3, taking ResNet block[5] as an example, a straight-forward way to apply MixTConv is to plug it after the first convolution, which is denoted as MST Block inner(Figure 3(a)). However, it harms the capability of spatiotemporal feature learning since the channels are reduced in the bottleneck. To address such issue, we propose MST Block head(Figure 3(b)), which plugs MixTConv between residual operation and the first convolution. The computational cost is negligible for both MST Block head and MST Block inner(0.18G FLOPS and 0.05G FLOPS) due to the depthwise fashion. As shown in Table  3, MST Block head achieves better recognition accuracy, verifying our assumption. So that we use MST Block head as our MST Block.

Dataset Model MixTConv Top-1 Top-5 Top-1

Something v1
TSN[19] 20.5 47.5 +27.6
Ours 48.1 77.3
Something v2 TSN[19] 30.4 61.0 +31.4
Ours 61.8 87.8
Jester TSN[19] 83.9 99.6 +13.0
Ours 96.9 99.9
Table 1: Comparisons between the proposed MSTNet and 2D CNN baseline TSN(protocol: ResNet-50 8f input, 2 clips for all datasets, full-resolution).

3.3 Network Design

Based on the Mixed Spatiotemporal Block, a network named MSTNet is built for action recognition. In order to keep the framework efficient, we choose the 2D ResNet50[5] as our backbone to achieve a good trade-off between the accuracy and the speed. We replace all residual blocks with the proposed MST Blocks. The pipeline follows the popular TSN[19] framework, which samples frames sparsely and then passes them through the 2D CNNs followed by a consensus aggregation function(e.g. Average pooling). For both TSN and MSTNet, the final output of a video is:

Method Kernel Size Dilation Learnable Top-1 FLOPS
TSN(baseline)[19] - - 19.7 33G
TSN+Ordinary 1D 3 1 41.0 43G
TSM*[9] 3* 1 45.6 33G
TSN+ks3 3 1 45.9 33.13G
TSN+ks5 5 1 46.3 33.23G
TSN+ks7 7 1 45.8 33.32G
TSN+ks13 1,3 1 45.8 33.09G
TSN+ks135 1,3,5 1 46.4 33.13G
TSN+ks1357 1,3,5,7 1 46.7 33.18G
TSN+ks357 3 1,2,3 46.4 33.13G
Table 2: Comparisons of different temporal operations and configurations (i.e., the kernel size and the combinations of the filters) on Something-Something v1. ”ks”denotes kernel size and * denotes shifting convolution.
Method Insert place Top-1 Top-5 FLOPS
MST Block inner after 1x1 45.8 74.4 33.05G
MST Block head before 1x1 46.7 75.6 33.18G
Table 3: Comparisons of two blocks that integrate MixTConv on Something-Something v1.

where T is the number of sampled frames(segments) in the video, and is the output feature of the - frame by the network. It is obvious that, by using simple consensus aggregation on final score of each frame, TSN lacks capability of modeling the temporal relationship. Results in Table  1 shows that, with MixTConv operation, MSTNet significantly boosts the performance of TSN.

4 Experiment

4.1 Dataset and Implementation details

Dataset Something-Something v1 and v2 [4] are two large-scale video datasets for action recognition. The datasets contain 110k(v1) and 220k(v2) videos, respectively, each with around 50 frames. The videos are annotated into 174 fine-grained human action classes with various objects and viewpoints. In these two datasets, videos with similar labels, e.g. ”opening the door” v.s. ”closing the door”, are indistinguishable without exploiting temporal information. Jester[10] is a large collection of densely-labeled video clips that show humans performing pre-definded hand gestures in front of a laptop camera or webcam. It contains 27 classes of hand gestures and each with around 5,000 instances, which makes it possible to train robust model on gesture recognition.

Training All experiments in this paper adopt ResNet-50[5] pre-trained on ImageNet[2] as backbone, and further fine-tuned on target datasets. For an apple-to-apple comparison with state-of-the-art methods[9]

, we strictly follow the same training protocols. The initial learning rate is set as 0.01 and decays by 0.1 at epoch 30&40&45. We train the networks for 50 epochs with weight decay of 5e-4, batch size of 64 and dropout as 0.5. For data augmentation, we follow TSN

[19] to sample one frame from every 8 or 16 segments. Then we resize their short side to 256 and meanwhile keep the aspect ratio as 4:3. After that, we exploit corner cropping and scale-jittering.

Method Backbone Modality Frames Params FLOPs Something-Something v1 Something-Something v2
Val Top-1 Val Top-5 Val Top-1 Val Top-5
TSN[19]ECCV’16 BNIception RGB 8 10.7M 16G 19.5 - - -
TSN(baseline)[19]ECCV’16 ResNet-50 RGB 8 24.3M 33G 19.7 46.6 27.8 57.6
TRN Multiscale[23]ECCV’18 BNInception RGB 8 18.3M 16G 34.4 - 44.8 77.6
TRN Two-steam[23]ECCV’18 BNInception RGB+Flow 8+8 36.6M - 42.0 - 55.5 83.1
I3D[1]CVPR’17 3D ResNet-50 RGB 322clips 28.0M 153G2 41.6 72.2 - -
NL*+I3D[20]CVPR’18 3D ResNet-50 RGB 322clips 35.3M 168G2 44.4 76.0 - -
NL*+I3D+GCN[21]ECCV’18 3D ResNet-50+GCN RGB 322clips 62.2M 303G2 46.1 76.8 - -
ECO[24]ECCV’18 BNIn+Res3D RGB 8 47.5M 32G 39.6 - - -
ECO[24]ECCV’18 BNIn+Res3D RGB 16 47.5M 64G 41.4 - - -
[24]ECCV’18 BNIn+Res3D RGB 92 150M 267G 46.4 - - -
TSM[9]ICCV’19 ResNet-50 RGB 8 24.3M 33G 45.6 74.2 58.7* 85.4
TSM[9]ICCV’19 ResNet-50 RGB 16 24.3M 65G 47.2 77.1 61.0* 86.8
MSTNet ResNet-50 RGB 8 24.3M 33.2G 46.7 75.4 59.5 86.0
MSTNet ResNet-50 RGB 16 24.3M 65.3G 48.4 78.8 61.8 87.3
  • BNInc means BNInception, Res3D18 means 3D Resnet 18, NL means Non-Local[20].

  • Using offical released pre-trained weight and testing with one clip and center crop.

Table 4: Comparisons with state-of-the-art methods on Something-Something v1 and Something-Something v2.
Method Modality Frames FLOPS Top-1 Top-5
TSN[19] RGB 8 33G 81.0 99.0
TSN[19] RGB 16 65G 82.3 99.2
TRN-MS*[23] RGB 8 16G 93.7 -
TSM*[9] RGB 8 33G 94.5* 99.7
TSM*[9] RGB 16 65G 95.3* 99.8
MSTNet RGB 8 33.2G 96.0 99.8
MSTNet RGB 16 65.3G 96.8 99.8
  • Using offical released pre-trained weight and testing with one clip and center crop.

  • MS means multi-scale.

Table 5: Comparison of state-of-the-art methods on Jester.

Testing A common practice for testing is to apply 10 crops to each frame[19, 1]. Moreover, many state-of-the-art methods[7, 18] use dense frames of 64 or 128 in multiple clips(e.g. 10), leading to a huge computation. For efficiency, we use only one clip per video and the center 224x224 crop for evaluation based on RGB only if not specified. Also, for direct comparison to 2D CNN baseline[19], we sample 2 clips per video and use the full resolution image with shorter side of 256 for evaluation(as in Table 1).

4.2 Ablation Study

Improving 2D CNN Baselines MixTConv can be plugged into any normal 2D CNNs and boost their performance on video recognition. We verify its effectiveness by comparing the performance of MSTNet and the 2D CNN basline, TSN[19], with the same training and testing protocol. Noting that, the only difference between MSTNet and TSN is with or without MixTConv. Table 1 shows that 2D CNN baseline cannot achieve a good accuracy on the datasets with temporal information(i.e., Something-Something v1 and v2), but once equipped with MixTConv, the performance improves significantly with negligible increments of computational cost and parameters. These results demonstrate that MixConv is very effective and efficient for action recognition.

Comparison of different temporal operations and configurations We further compare temporal aggregation with different configurations (i.e., the kernel sizes and the combinations of the filters). As shown in Table 2: 1)Depthwise 1D convolution achieves better performance than ordinary 1D with large computation saving; 2)Depthwise 1D convolution with kernel size = 3 performs better than shifting operation, which implies that fixed weight temporal convolution is not good enough for temporal modeling; 3)Larger kernel size along temporal axes not always lead to higher accuracy(TSN+ks7 gets 0.5 % lower accuracy than that of TSN+ks5); 4)combination of multiple kernel sizes achieve much better performance than single kernel size, demonstrating that the design of MixTConv is effective and reasonable.

Plugging Position We further explore where to plug the MixTConv in ResNet building block. Table 3 shows that MST Block head performs better with nearly same computational complexity to MST Block inner. We guess the reason is that depthwise fashion needs more channels to model features, which is proved in MobilenetV2[13]. Hence, we finally choose MST Block head as the network building block.

4.3 Comparison with the State-of-the-Art

Something-Something v1  We compare MSTNet with state-of-the-art methods on Something-Something v1 in Table 4. The comparison details are as follows: 1) TRN[23] and TSN[19] are based on 2D CNNs. TSN achieves poor performance due to the lack of temporal modeling. Notably, our single-stream network outperforms two-steam TRN[23] by 5% absolutely, which implies the importance of temporal fusion for all layers. 2) Non-local I3D[20] with GCN[21] is the state-of-the-art 3D CNN based model. It’s worth noting that, the GCN needs a Reion Proposal Network(RPN)[12] trained on other object detection datasets to get the bounding boxes, which has extra training cost. Compared with the Non-local I3D+GCN, our MSTNet achieves 0.8 % better accuracy with 20 fewer FLOPs on the validation dataset. 3) ECO[24] and TSM[9] are two state-of-the-art efficient action recognition methods. Compared to ECO, our method achieves 0.3% better accuracy at 9 less computation with 6 less parameters. Compared to TSM, we achieve 1.1% and 1.2% better accuracy with little extra computational cost(0.005 %). These results demonstrate that our proposed Mixed Temporal Convolution(MixTConv) is a better way to model temporal information than other temporal operations like 3D convolution and shifting.

Something-Something v2  As illustrated in Table 4, for this larger and newer dataset to the previous v1, our model also achieves better results than SOTA methods, with RGB modality only. These results demonstrate the effectiveness of the proposed MixTConv operation and MSTNet for action recognition once again.

Jester  As shown in Table 5, on the benchmark Jester, our MSTNet also gains a large improvement compared to the TSN baseline(+15%), and outperforms all the recent state-of-the-art methods, for the task of gesture recognition.

5 Conclusion

In this work, we propose a lightweight and plug-and-play operation named Mixed Temporal Convolution (MixTConv) for action recognition, which partitions input channels into groups and performs depthwise 1D convolution with different kernel sizes to capture multi-scale temporal information. It can be flexibly inserted into any 2D CNN backbones to enable temporal modeling with negligible extra computational cost. We further design a Mixed Spatiotemporal Network (MSTNet) for action recognition, by plugging MixTConv into the building block of ResNet-50. Experimental results on Something-Something v1, v2 and Jester benchmarks consistently indicate the superiority of the proposed MSTNet with the MixTConv operation. Additional ablation studies further demonstrate that the designs of the proposed MixTConv operation and MSTNet are effective and reasonable.


  • [1] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In CVPR, Cited by: §1, §2, §4.1, Table 4.
  • [2] J. Deng and et al. (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §2, §4.1.
  • [3] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In CVPR, Cited by: §1, §2.
  • [4] R. Goyal and et al. (2017) The ”something something” video database for learning and evaluating visual common sense. In ICCV, Cited by: §1, §4.1.
  • [5] K. He, X. Zhang, and et al. (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §3.2, §3.3, §4.1.
  • [6] N. Hussein, E. Gavves, and A. W. Smeulders (2019) Timeception for complex action recognition. In CVPR, Cited by: §2.
  • [7] S. Ji, W. Xu, and et al. (2013)

    3D convolutional neural networks for human action recognition

    TPAMI. Cited by: §2, §4.1.
  • [8] A. Karpathy and et al. (2014) Large-scale video classification with convolutional neural networks. In CVPR, Cited by: §1.
  • [9] J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In ICCV, Cited by: §1, §2, §3.1, Table 2, §4.1, §4.3, Table 4, Table 5.
  • [10] J. Materzynska, G. Berger, I. Bax, and R. Memisevic (2019) The jester dataset: a large-scale video dataset of human gestures. In ICCVW, Cited by: §4.1.
  • [11] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, Cited by: §1, §2.
  • [12] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, Cited by: §4.3.
  • [13] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §4.2.
  • [14] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. ICLR. Cited by: §2.
  • [15] C. Szegedy, W. Liu, and et al. (2015) Going deeper with convolutions. In CVPR, Cited by: §2.
  • [16] M. Tan and Q. V. Le (2019) Mixconv: mixed depthwise convolutional kernels. ArXiv, abs/1907.09595 7. Cited by: §2.
  • [17] D. Tran and et al. (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, Cited by: §1, §2.
  • [18] D. Tran and et al. (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, Cited by: §1, §2, §4.1.
  • [19] L. Wang and et al. (2018) Temporal segment networks for action recognition in videos. TPAMI. Cited by: §1, §2, §3.3, Table 1, Table 2, §4.1, §4.1, §4.2, §4.3, Table 4, Table 5.
  • [20] X. Wang and et al. (2018) Non-local neural networks. In CVPR, Cited by: 1st item, §4.3, Table 4.
  • [21] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In ECCV, Cited by: §4.3, Table 4.
  • [22] S. Xie, C. Sun, and et al. (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In ECCV, Cited by: §1, §2.
  • [23] B. Zhou, A. Andonian, and et al. (2018) Temporal relational reasoning in videos. In ECCV, Cited by: §1, §2, §4.3, Table 4, Table 5.
  • [24] M. Zolfaghari, K. Singh, and T. Brox (2018) ECO: efficient convolutional network for online video understanding. In ECCV, Cited by: §4.3, Table 4.