Video classification has witnessed much good progress in the last few years. Most of the accuracy improvements have been resulted from the introduction of new powerful architectures [3, 29, 22, 36, 34]. However, many of these architectures are built on relatively expensive 3D spatiotemporal convolutions. Furthermore, these convolutions are typically computed across all the channels in each layer. 3D CNNs have complexity as opposed to the cost of of 2D CNNs. For both foundational and practical reasons, it is natural to ask which parameters in these large 4D kernels matter the most.
Kernel factorizations have been applied in several settings to reduce compute and improve accuracy. For example, several recent video architectures factor 3D convolution in space and time: examples include P3D , R(2+1)D , and S3D . In these architectures, a 3D convolution is replaced with a 2D convolution (in space) followed by a 1D convolution (in time). This factorization can be leveraged to increase accuracy and/or to reduce computation. In the still-image domain, separable convolution  is used to factor the convolution of 2D filters into a pointwise convolution followed by a depthwise convolution. When the number of channels is large compared to , which is usually the case, this reduces FLOPs by for images. For the case of 3D video kernels, the FLOP reduction is even more dramatic: .
Inspired by the accuracy gains and good computational savings demonstrated by 2D separable convolutions in image classification [7, 15, 38], this paper proposes a set of architectures for video classification – 3D Channel-Separated Networks (CSN) – in which all convolutional operations are separated into either pointwise 111 or depthwise 333 convolutions. Our experiments reveal the crucial importance of channel interaction in the design of CSNs. In particular, we show that excellent accuracy/cost balances can be obtained with CSNs by leveraging channel separation to reduce FLOPs and parameters as long as high values of channel interaction are retained. We propose two factorizations, which we call interaction-reduced and interaction-preserved. Compared to 3D CNNs, both our interaction-reduced and interaction-preserved CSNs provide both higher accuracy and FLOP savings of about 6-7 when there is enough channel interaction. We experimentally show that the channel factorization in CSNs acts as a regularizer, leading to a higher training error but better generalization. Finally, we show that our proposed CSNs significantly outperform current state-of-the art methods on Sports1M and Kinetics while being 11 times faster.
2 Related Work
Group convolution. Group convolution was adopted in AlexNet  as a way to overcome GPU memory limitations. Depthwise convolution was introduced in MobileNet  as an attempt to optimize model size and computational cost for mobile applications. Chollet  built an extreme version of Inception  based on 2D depthwise convolution, named Xception, where the Inception block was redesigned to include multiple separable convolutions. Concurrently, Xie et al. proposed ResNeXt  by equipping ResNet  bottleneck blocks with groupwise convolution. Further architecture improvements have also been made for mobile applications. ShuffleNet  further reduced the computational cost of the bottleneck block with both depthwise and group convolution. MobileNetV2  improved MobileNet  by switching from a VGG-style to a ResNet-style network, and introducing a “reverted bottleneck” block. All of these architectures are based on 2D CNNs and are applied to image classification while our work focuses on 3D group CNNs for video classification.
to deep network approaches that learn features and classify end-to-end[28, 16, 25, 10, 32, 33, 11]. This transformation was enabled by the introduction of large-scale video datasets [16, 17] and massively parallel computing hardware, i.e., GPU. Carreira and Zisserman  recently proposed to inflate 2D convolutional networks pre-trained on images to 3D for video classification. Wang et al. 
proposed non-local neural networks to capture long-range dependencies in videos. ARTNet decouples spatial and temporal modeling into two parallel branches. Similarly, 3D convolutions can also be decomposed into a Pseudo-3D convolutional block as in P3D  or factorized convolutions as in R(2+1)D  or S3D . 3D group convolution was also applied to video classification in ResNeXt  and Multi-Fiber Networks  (MFNet).
Among previous approaches, our work is most closely related to the following architectures. First, our CSNs are similar to Xception  in the idea of using channel-separated convolutions. Xception factorizes 2D convolution in channel and space for object classification, while our CSNs factorize 3D convolution in channel and space-time for action recognition. In addition, Xception uses simple blocks, while our CSNs use bottleneck blocks. The variant ir-CSN of our model shares similarities with ResNeXt  and its 3D version  in the use of bottleneck block with group/depthwise convolution. The main difference is that ResNext [35, 13] uses group convolution in its 333 layers with a fixed group size (e.g. ), while our ir-CSN uses depthwise convolutions in all 333 layers which makes our architecture fully channel-separated. As we will show in section 4.2, making our network fully channel-separated helps not only to reduce a significant amount of compute, but also to improve model accuracy by better regularization. We emphasize that our contribution includes not only the design of CSN architectures, but also a systematic empirical study of the role of channel interactions in the accuracy of CSNs.
3 Channel-Separated Convolutional Networks
In this section, we discuss the concept of 3D channel-separated networks. Since channel-separated networks use group convolution as their main building block, we first provide some background about group convolution.
Group convolution. Conventional convolution is implemented with dense connections, i.e., each convolutional filter receives input from all channels of its previous layer, as in Figure 1(a). However, in order to reduce the computational cost and model size, these connections can be sparsified by grouping convolutional filters into subsets. Filters in a subset receive signal from only channels within its group (see Figure 1(b)). Depthwise convolution is the extreme version of group convolution where the number of groups is equal to the number of input and output channels (see figure 1(c)). Xception  and MobileNet  were among the first networks to use depthwise convolutions. Figure 1 presents an illustration of conventional, group, and depthwise convolutional layers for the case of input channels and output channels.
Counting FLOPs, parameters, and interactions. Dividing a conventional convolutional filter into groups reduces compute and parameter count by a corresponding factor of . These reductions occur because each filter in a group receives input from only a fraction of the channels from the previous layer. In other words, channel grouping restricts feature interaction: only channels within a group can interact. If multiple group convolutional layers are stacked directly on top of each other, this feature segregation is further amplified as each channel becomes a function of small channel-subsets in all preceding layers. So, while group convolution saves compute and parameters, it also reduces feature interactions.
We propose to quantify the amount of channel interaction as the number of pairs of two input channels that are connected through any output filter. If the convolutional layer has channels and groups of filters, then each filter is connected to input channels. Therefore each filter will have interacting feature pairs. According to this definition, the example convolutions in Figure 1(a)-(c) will have , , and channel interaction pairs, respectively.
Consider a convolutional layer with a kernel spatiotemporal size (e.g. ), groups of filters, input channels,
output channels applied to a spatiotemporal tensor ofvoxels. Its number of parameters, FLOPs (floating-point operations), and number of channel interactions can be measured as:
Recall that . We note that while FLOPs and parameter count are popularly used to characterize a layer, the “amount” of channel interaction is typically overlooked. Our study will reveal the importance of this factor.
3.2 Channel Separation
We define channel-separated convolutional networks (CSN) as 3D CNNs in which all convolutional layers (except for conv1) are either 111 conventional convolutions or depthwise convolutions (where, typically, ). Conventional convolutional networks model channel interactions and local interactions (i.e., spatial or spatiotemporal) jointly in their 3D convolutions. Instead, channel-separated networks decompose these two types of interactions into two distinct layers: 111 conventional convolutions for channel interaction (but no local interaction) and depthwise convolutions for local spatiotemporal interactions (but not channel interaction). Channel separation may be applied to any traditional convolution by decomposing it into a 111 convolution and a depthwise convolution.
We introduce the term “channel-separated” to highlight the importance of channel interaction; we also point out that the existing term “depth-separable” is only a good description when applied to tensors with two spatial dimensions and one channel dimension. We note that channel-separated networks have been proposed in Xception  and MobileNet  for image classification. In video classification, separated convolutions have been used in P3D , R(2+1)D , and S3D , but to decompose 3D convolutions into separate temporal and spatial convolutions. The network architectures presented in this work are designed to separate channel interactions from spatiotemporal interactions.
3.3 Example: Channel-Separated Bottleneck Block
Figure 2 presents two ways of factorizing a 3D bottleneck block using channel-separated convolutional networks. Figure 2(a) presents a standard 3D bottleneck block, while Figure 2(b) and 2(c) present interaction-preserved and interaction-reduced channel-separated bottleneck blocks, respectively.
Interaction-preserved channel-separated bottleneck block is obtained from the standard bottleneck block (Figure 2(a) by replacing the 333 convolution in (a) with a 111 traditional convolution and a 333 depthwise convolution (shown in Figure 2(b)). This block reduces parameters and FLOPs of the traditional 333 convolution significantly, but preserves all channel interactions via a newly-added 111 convolution. We call this an interaction-preserved channel-separated bottleneck block and the resulting architecture an interaction-preserved channel-separated network (ip-CSN).
Interaction-reduced channel-separated bottleneck block is derived from the preserved bottleneck block by removing the extra 111 convolution. This yields the depthwise bottleneck block shown in Figure 2(c). Note that the initial and final 111 convolutions (usually interpreted respectively as projecting into a lower-dimensional subspace and then projecting back to the original dimensionality) are now the only mechanism left for channel interactions. This implies that the complete block shown in (c) has a reduced number of channel interactions compared that shown in (a) or (b). We call this design an interaction-reduced channel-separated bottleneck block and the resulting architecture an interaction-reduced channel-separated network (ir-CSN).
3.4 Channel Interactions in Convolutional Blocks
The interaction-preserving and interaction-reducing blocks in section 3.3 are just two architectures in a large spectrum. In this subsection we present a number of convolutional block designs, obtained by progressively increasing the amount of grouping. The blocks differ in terms of compute cost, parameter count and, more importantly, channel interactions.
Group convolution applied to ResNet blocks. Figure 3(a) presents a ResNet  simple block consisting of two 333 convolutional layers. Figure 3(b) shows the simple-G block, where the 333 layers now use grouped convolution. Likewise, Figure 3(c) presents simple-D, with two depthwise layers. Because depthwise convolution requires the same number of input and output channels, we optionally add a 111 convolutional layer (shown in the dashed rectangle) in blocks that change the number of channels.
Figure 4(a) presents a ResNet bottleneck block consisting of two 111 and one 333 convolutional layers. Figures 4(b-c) present bottleneck-G and bottleneck-D where the 333 convolutions are grouped and depthwise, respectively. If we further apply group convolution to the two 111 convolutional layers, the block becomes a bottleneck-DG, as illustrated in Figure 4(d). In all cases, the 333 convolutional layers always have the same number of input and output channels.
There are some deliberate analogies to existing architectures here. First, bottleneck-G (Figure 4(b)) is exactly a ResNeXt block , and bottleneck-D is its depthwise variant. Bottleneck-DG (Figure 4(d)) resembles the ShuffleNet block , without the channel shuffle and without the downsampling projection by average pooling and concatenation. The progression from simple to simple-D is similar to moving from ResNet to Xception (though Xception has many more 111 convolutions). We omit certain architecture-specific features in order to better understand the role of grouping and channel interactions.
4 Ablation Experiment
This empirical study will allow us to cast some light on the important factors in the performance of channel-separated network and will lead us to two main findings:
We will empirically demonstrate that within the family of architectures we consider, similar depth and similar channel interaction count implies similar performance. In particular, the interaction-preserving blocks reduce compute by significant margins but preserve channel interactions, with only a slight loss in accuracy for shallow networks and an increase in accuracy for deeper networks.
In traditional 333 convolutions all feature maps interact with each other. Particularly for deeper networks, this causes overfitting.
4.1 Experimental setup
Dataset. We use Kinetics-400  for all ablation experiments in this section. Kinetics is a standard benchmark for action recognition in videos. It contains about 260k videos of different human action categories. We use the train split (240k videos) for training and the validation split (20k videos) for evaluating different models.
Base architecture. We use ResNet3D, presented in Table 1, as our base architecture for most of our ablation experiments in this section. More specifically, our model takes clips with a size of L224224 where is the number of frames, is the height and width of the cropped frame. Two spatial downsampling layers (122) are applied at conv1 and at pool1, and three spatiotemporal downsampling (222) are applied at conv3_1, conv4_1 and conv5
_1 via convolutional striding. A global spatiotemporal average pooling with kernel size77 is applied to the final convolutional tensor, followed by a fully-connected (fc) layer performing the final classification. We note that in Table 1, are hyper-parameters which define network width, while control the network depth.
|layer name||output size||ResNet3D-simple||ResNet3D-bottleneck|
|conv1||L112112||377, 64, stride 122|
|pool1||L5656||max, 133, stride 122|
|pool5||111||spatiotemporal avg pool, fc layer with softmax|
Data augmentation. We use both spatial and temporal jittering for augmentation. More specifically, video frames are scaled such that the shorter edge of the frames become while we maintain the frame original aspect ratio. During training, is randomly picked between and with the same chance. Each clip is then generated by randomly cropping windows of size 224224. Temporal jittering is also applied during training by randomly selecting a starting frame and decoding frames. For the ablation experiments in this section we train and evaluate models with clips of 8 frames () by skipping every other frame (all videos are pre-processed to 30fps, so the newly-formed clips are effectively at 15fps).
Training. We train our models with synchronous distributed SGD on GPU clusters using caffe2  (with 16 machines, each having GPUs). We use a mini-batch of clips per GPU, thus making a total mini-batch of clips. Following 
, we set epoch size to 1M clips due to temporal jitterring augmentation even though the number of training examples is only about 240K. We use the half-cosine period learning rate schedule as presented in in which the learning at the -th iteration is set to , where is the maximum number of training iterations and the initial learning rate is set to . Training is done in epochs where we use model warming-up  in the first epochs and the remaining epochs will follow the cosine learning rate schedule.
Testing. We report clip top-1 accuracy and video top-1 accuracy. For video top-1, we use center crops of clips uniformly sampled from the video and average these clip-predictions to obtain the final video prediction.
4.2 Reducing FLOPs, preserving interactions
In this ablation, we use CSNs to vary both FLOPs and channel interactions. Within this architectural family, channel interactions are a good predictor of performance, whereas FLOPs are not. In particular, FLOPs can be reduced significantly while preserving interaction count.
Table 2 presents results of our interaction-reduced CSNs (ir-CSNs) and interaction-preserved CSNs (ip-CSNs) and compare them with the ResNet3D baseline using different number of layers. In the shallow network setting (with 26 layers), both the ir-CSN and the ip-CSN have lower accuracy than ResNet3D. The ir-CSN provides a computational savings of 7x but causes a drop in accuracy. The ip-CSN yields a savings of 6x in FLOPs with a much smaller drop in accuracy (). We note that all of the shallow models have very low count of channel interactions: ResNet3D and ip-CSN have about giga-pairs, while ir-CSN has only giga-pairs (about 64% of the original). This observation suggests that shallow instances of ResNet3D benefit from their extra parameters, but the preserving of channel interactions decrease the gap for ip-CSN.
In deeper settings, both ir-CSNs and ip-CSNs actually outperform ResNet3D by about . Furthermore, the gap between ir-CSN and ip-CSN becomes smaller. We attribute this shrinking of the gap to the fact that, in the 50-layer and 101-layer configurations, ir-CSN has nearly the same number of channel interactions as ip-CSN since most interactions stem from the 111 layers. One may wonder if ip-CSNs outperform ResNet3D and ir-CSNs because of having more nonlinearities (ReLU). To answer this question, we trained ip-CSNs without ReLUs between the 111 and the 333 layers and we observed no notable difference in performance. We can observe that traditional 333 convolutions contain many parameters which can be removed without an accuracy penalty in the deeper models. We investigate this next.
4.3 What makes CSNs outperform ResNet3D?
In section 4.2 we found that both ir-CSNs and ip-CSNs consistently outperform the ResNet3D baseline when there are enough channel interactions, while having fewer parameters and greatly reducing FLOPs. It is natural to ask “what helps CSNs in these scenario?”. Figure 5 helps us answer this question. The plot shows the evolution of the training and validation errors of ip-CSN and ResNet3D in both the 50-layer and the 101-layer configuration. Compared to ResNet3D, ip-CSN has higher training errors but lower testing errors. This suggests that the channel-separated convolutions of CSN regularize the model and prevent overfitting.
4.4 The effects of different blocks in group convolutional networks
In this section we start from our base architecture (shown in Table 1) then ablatively replace the convolutional blocks with the blocks presented in section 3.4. We again find that channel interaction plays an important role in understanding the results.
Naming convention. Since the ablation in this section will be considering several different convolutional blocks, to simplify the presentation, we name each architecture by block type (as presented in section 3.4) and total number of blocks, as shown in the last column of Table 3.
|ResNet3D-18||simple||[2, 2, 2, 2]||simple-8|
|ResNet3D-26||bottleneck||[2, 2, 2, 2]||bottleneck-8|
|ResNet3D-34||simple||[3, 4, 6, 3]||simple-16|
|ResNet3D-50||bottleneck||[3, 4, 6, 3]||bottleneck-16|
Figure 6 presents the results of our convolutional block ablation study. It plots the video top-1 accuracy of Kinetics validation set against the model computational cost (# FLOPs). We note that, in this experiment, we use our base architecture with two different number of blocks (8 and 16) and just vary the type of convolutional block and number of groups to study the tradeoffs. Figure 6(a) presents our ablation experiment with simple-X-8 and bottleneck-X-8 architectures (where X can be none, G, or D, or even DG in the case of bottleneck block). Similarly, Figure 6(b) presents our ablation experiment with simple-X-16 and bottleneck-X-16 architectures. We can observe the computation/accuracy effects of the group convolution transformation on our base architectures. Reading each curve from right to left (i.e. in decreasing accuracy), we see simple-X transforming from simple block to simple-G (with increasing number of groups), then to simple-D block. For bottleneck-X, reading right to left shows bottleneck block, then transforms to bottleneck-G (with increasing groups), bottleneck-D, then finally to bottleneck-DG (again with increasing groups).
While the general downward trend is expected as we decrease parameters and FLOPs, the shape of the simple and bottleneck curves is quite different. The simple-X models degrade smoothly, whereas bottleneck-X stays relatively flat (particularly bottleneck-16, which actually increases slightly as we decrease FLOPs) before dropping sharply.
In order to understand better the different behaviors of the simple-X-Y and bottleneck-X-Y curves (blue vs. red curves) in Figure 6 and the main reason behind the turning points of bottleneck-D block (green start markers in Figure 6), we further plot together all of these models in another view: accuracy as a function of channel interactions (Figure 7).
As shown in Figure 7, the number of channel interactions in simple-X-Y models (blue squares and red diamonds) drops quadratically when group convolution is applied to their 333 layers. In contrast, the number of channel interactions in bottleneck-X-Y models (green circles and purple triangles) drops marginally when group convolution is applied to their 333 since they still have many 111 layers (this can be seen in the presence of two marker clusters which are circled in red: the first cluster includes purple triangles near the top-right corner and the other one includes green circles near the center of the figure). The channel interaction in bottleneck-X-Y starts to drop significantly when group convolution is applied to their 111 layers, and causes the model sharp drop in accuracy. This fact explains well why there is no turning point in simple-X-Y curves and also why there are turning points in bottleneck-X-Y curves. It also confirms the important role of channel interactions in group convolutional networks.
Bottleneck-D block (also known as ir-CSN) provides the best computation/accuracy tradeoff. For simple blocks, increasing the number of groups causes a continuous drop in accuracy. However, in the case of the bottleneck block (i.e. bottleneck-X-Y) the accuracy curve remains almost flat as we increase the number of groups until arriving at the bottleneck-D block, at which point the accuracy degrades dramatically when the block is turned into a bottleneck-DG (group convolution applied to 111 layers). We conclude that a bottleneck-D block (or ir-CSN) gives the best computation/accuracy tradeoff in this family of ResNet-style blocks, due to its high channel-interaction count.
5 Comparison with the State-of-the-Art
In this section, we evaluate our proposed architectures, i.e., ir-CSNs and ip-CSNs, and compare them with state-of-the-art methods.
Datasets. We evaluate our CSNs on two public benchmarks: Sports-1M  and Kinetics  (version 1 with 400 action categories). Sports-1M is a large-scale action recognition dataset which consists of about 1.1 million videos from classes of fine-grained sports. Kinetics is a medium-size dataset which includes about 300K videos of different human action categories. For Sports-1M, we use the public train and test splits provided with the dataset. For Kinetics, we use the train split for training and the validation set for testing.
Training and testing. Differently from our ablation experiments in the previous section, here we train our CSNs with -frame clip inputs () with a sampling rate of (skipping every other frame) following the practice described in . All the other training settings such as data augmentation and optimization parameters are the same as those described in our previous section. For testing, we uniformly sample clips from each testing video. Each clip is scaled such that its shorter edge become , then cropped to (i.e., each input clip has a size of 32256256). Each crop is passed through the network to be evaluated as in a fully-convolutional network (FCN). Since our network was trained with a fully-connected layer, during FCN inference this FC layer is transformed into an equivalent 111 convolutional layer with weights copied from the FC layer.
Results on Sports-1M. Table 4 reports result of our ir-CSNs and compares them with current state-of-the-art methods on Sports-1M. Our ir-CSN-152 outperforms C3D  by , P3D  by , Conv Pooling  by , and R(2+1)D  by on video top-1 accuracy while being 2-4x faster than R(2+1)D. Our ir-CSN-101, even with a smaller number of FLOPs, still outperforms all previous work by good margins.
|Conv pool ||RGB+OF||71.7||90.4||NA|
Results on Kinetics. We train our proposed CSN models on Kinetics and compare them with current state-of-the-art methods. Beside training from scratch, we also fine-tune our CSNs with weights initialized from models pre-trained on Sports1M. For a fair comparison, we compare our CSNs with the methods that use only RGB as input. Table 5 presents the results of our CSNs and compares them with current methods. Our CSNs, even trained from scratch, already outperform all of the previously published work, except for non-local networks . Our ir-CSN-152, pre-trained on Sports1M, significantly outperforms I3D , R(2+1)D , and S3D-G  by , , and , respectively. It also outperforms recent work: -Net  by , Global-reasoning networks  by . Finally, our ir-CSN-152 slightly outperforms non-local networks  by and Slow-Fast networks  by while being 11x and 3.5x faster Non-local and Slow-Fast networks. Our ir-CSN-152 is still lower than SlowFast networks when it is augmented with non-local networks. We note that our CSNs use only 10 crops per testing video while other methods use dense sampling [3, 36, 29], e.g. sample all possible overlapped clips, which normally requires running inference on a few hundreds clips per testing video.
|NL I3D ||ImageNet||77.7||93.3||35930|
We have presented Channel-Separated Convolutional Networks (CSN) as a way of factorizing 3D convolutions. The proposed CSN-based factorization not only helps to significantly reduce the computational cost, but also improves the accuracy when there are enough channel interactions in the networks. Our proposed architecture, ir-CSN, significantly outperforms existing methods and obtains state-of-the-art accuracy on two major benchmarks: Sports1M and Kinetics. The model is also multiple times faster than current competing networks.
Acknowledgement. The authors would like to thank Kaiming He for providing insightful discussions about the architectures, Haoqi Fan for helping in improving our training infrastructures.
-  Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of off-the-shelf temporal modeling approaches for large-scale video classification. CoRR, abs/1708.03805, 2017.
Caffe2: A new lightweight, modular, and scalable deep learning framework.https://caffe2.ai/.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
-  Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. A^2-nets: Double attention networks. In NeuIPS, pages 350–359, 2018.
-  Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. Multi-fiber networks for video recognition. In ECCV, 2018.
-  Y. Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y. Kalantidis. Graph-based global reasoning networks. In CVPR, 2019.
-  F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
-  P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In Proc. ICCV VS-PETS, 2005.
-  C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. CoRR, abs/1812.03982, 2018.
-  C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
-  C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei.
Large-scale video classification with convolutional neural networks.In CVPR, 2014.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  I. Laptev and T. Lindeberg. Space-time interest points. In ICCV, 2003.
-  J. Lin, C. Gan, and S. Han. Temporal shift module for efficient video understanding. CoRR, abs/1811.08383, 2018.
I. Loshchilov and F. Hutter.
SGDR: stochastic gradient descent with restarts.In ICLR, 2017.
-  Z. Qiu, T. Yao, , and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, 2017.
-  S. Sadanand and J. Corso. Action bank: A high-level representation of activity in video. In CVPR, 2012.
-  M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018.
-  K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
-  J. C. Stroud, D. A. Ross, C. Sun, J. Deng, and R. Sukthankar. D3D: distilled 3d networks for video action recognition. CoRR, abs/1812.08249, 2018.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
-  H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
-  L. Wang, W. Li, W. Li, and L. V. Gool. Appearance-and-relation networks for video classification. In CVPR, 2018.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
-  X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In CVPR, 2016.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, 2018.
-  S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
-  S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. In ECCV, 2018.
-  J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In , pages 4694–4702, 2015.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.