1 Introduction
Video classification has witnessed much good progress in the last few years. Most of the accuracy improvements have been resulted from the introduction of new powerful architectures [3, 29, 22, 36, 34]. However, many of these architectures are built on relatively expensive 3D spatiotemporal convolutions. Furthermore, these convolutions are typically computed across all the channels in each layer. 3D CNNs have complexity as opposed to the cost of of 2D CNNs. For both foundational and practical reasons, it is natural to ask which parameters in these large 4D kernels matter the most.
Kernel factorizations have been applied in several settings to reduce compute and improve accuracy. For example, several recent video architectures factor 3D convolution in space and time: examples include P3D [22], R(2+1)D [29], and S3D [36]. In these architectures, a 3D convolution is replaced with a 2D convolution (in space) followed by a 1D convolution (in time). This factorization can be leveraged to increase accuracy and/or to reduce computation. In the stillimage domain, separable convolution [7] is used to factor the convolution of 2D filters into a pointwise convolution followed by a depthwise convolution. When the number of channels is large compared to , which is usually the case, this reduces FLOPs by for images. For the case of 3D video kernels, the FLOP reduction is even more dramatic: .
Inspired by the accuracy gains and good computational savings demonstrated by 2D separable convolutions in image classification [7, 15, 38], this paper proposes a set of architectures for video classification – 3D ChannelSeparated Networks (CSN) – in which all convolutional operations are separated into either pointwise 111 or depthwise 333 convolutions. Our experiments reveal the crucial importance of channel interaction in the design of CSNs. In particular, we show that excellent accuracy/cost balances can be obtained with CSNs by leveraging channel separation to reduce FLOPs and parameters as long as high values of channel interaction are retained. We propose two factorizations, which we call interactionreduced and interactionpreserved. Compared to 3D CNNs, both our interactionreduced and interactionpreserved CSNs provide both higher accuracy and FLOP savings of about 67 when there is enough channel interaction. We experimentally show that the channel factorization in CSNs acts as a regularizer, leading to a higher training error but better generalization. Finally, we show that our proposed CSNs significantly outperform current stateofthe art methods on Sports1M and Kinetics while being 11 times faster.
2 Related Work
Group convolution. Group convolution was adopted in AlexNet [18] as a way to overcome GPU memory limitations. Depthwise convolution was introduced in MobileNet [15] as an attempt to optimize model size and computational cost for mobile applications. Chollet [7] built an extreme version of Inception [27] based on 2D depthwise convolution, named Xception, where the Inception block was redesigned to include multiple separable convolutions. Concurrently, Xie et al. proposed ResNeXt [35] by equipping ResNet [14] bottleneck blocks with groupwise convolution. Further architecture improvements have also been made for mobile applications. ShuffleNet [38] further reduced the computational cost of the bottleneck block with both depthwise and group convolution. MobileNetV2 [24] improved MobileNet [15] by switching from a VGGstyle to a ResNetstyle network, and introducing a “reverted bottleneck” block. All of these architectures are based on 2D CNNs and are applied to image classification while our work focuses on 3D group CNNs for video classification.
Video classification. In the last few years, video classification has seen a major paradigm shift, which involved moving from handdesigned features [19, 8, 23, 30]
to deep network approaches that learn features and classify endtoend
[28, 16, 25, 10, 32, 33, 11]. This transformation was enabled by the introduction of largescale video datasets [16, 17] and massively parallel computing hardware, i.e., GPU. Carreira and Zisserman [3] recently proposed to inflate 2D convolutional networks pretrained on images to 3D for video classification. Wang et al. [34]proposed nonlocal neural networks to capture longrange dependencies in videos. ARTNet
[31] decouples spatial and temporal modeling into two parallel branches. Similarly, 3D convolutions can also be decomposed into a Pseudo3D convolutional block as in P3D [22] or factorized convolutions as in R(2+1)D [29] or S3D [36]. 3D group convolution was also applied to video classification in ResNeXt [13] and MultiFiber Networks [5] (MFNet).Among previous approaches, our work is most closely related to the following architectures. First, our CSNs are similar to Xception [7] in the idea of using channelseparated convolutions. Xception factorizes 2D convolution in channel and space for object classification, while our CSNs factorize 3D convolution in channel and spacetime for action recognition. In addition, Xception uses simple blocks, while our CSNs use bottleneck blocks. The variant irCSN of our model shares similarities with ResNeXt [35] and its 3D version [13] in the use of bottleneck block with group/depthwise convolution. The main difference is that ResNext [35, 13] uses group convolution in its 333 layers with a fixed group size (e.g. ), while our irCSN uses depthwise convolutions in all 333 layers which makes our architecture fully channelseparated. As we will show in section 4.2, making our network fully channelseparated helps not only to reduce a significant amount of compute, but also to improve model accuracy by better regularization. We emphasize that our contribution includes not only the design of CSN architectures, but also a systematic empirical study of the role of channel interactions in the accuracy of CSNs.
3 ChannelSeparated Convolutional Networks
In this section, we discuss the concept of 3D channelseparated networks. Since channelseparated networks use group convolution as their main building block, we first provide some background about group convolution.
3.1 Background
Group convolution. Conventional convolution is implemented with dense connections, i.e., each convolutional filter receives input from all channels of its previous layer, as in Figure 1(a). However, in order to reduce the computational cost and model size, these connections can be sparsified by grouping convolutional filters into subsets. Filters in a subset receive signal from only channels within its group (see Figure 1(b)). Depthwise convolution is the extreme version of group convolution where the number of groups is equal to the number of input and output channels (see figure 1(c)). Xception [7] and MobileNet [15] were among the first networks to use depthwise convolutions. Figure 1 presents an illustration of conventional, group, and depthwise convolutional layers for the case of input channels and output channels.
Counting FLOPs, parameters, and interactions. Dividing a conventional convolutional filter into groups reduces compute and parameter count by a corresponding factor of . These reductions occur because each filter in a group receives input from only a fraction of the channels from the previous layer. In other words, channel grouping restricts feature interaction: only channels within a group can interact. If multiple group convolutional layers are stacked directly on top of each other, this feature segregation is further amplified as each channel becomes a function of small channelsubsets in all preceding layers. So, while group convolution saves compute and parameters, it also reduces feature interactions.
We propose to quantify the amount of channel interaction as the number of pairs of two input channels that are connected through any output filter. If the convolutional layer has channels and groups of filters, then each filter is connected to input channels. Therefore each filter will have interacting feature pairs. According to this definition, the example convolutions in Figure 1(a)(c) will have , , and channel interaction pairs, respectively.
Consider a convolutional layer with a kernel spatiotemporal size (e.g. ), groups of filters, input channels,
output channels applied to a spatiotemporal tensor of
voxels. Its number of parameters, FLOPs (floatingpoint operations), and number of channel interactions can be measured as:(1)  
(2)  
(3) 
Recall that . We note that while FLOPs and parameter count are popularly used to characterize a layer, the “amount” of channel interaction is typically overlooked. Our study will reveal the importance of this factor.
3.2 Channel Separation
We define channelseparated convolutional networks (CSN) as 3D CNNs in which all convolutional layers (except for conv1) are either 111 conventional convolutions or depthwise convolutions (where, typically, ). Conventional convolutional networks model channel interactions and local interactions (i.e., spatial or spatiotemporal) jointly in their 3D convolutions. Instead, channelseparated networks decompose these two types of interactions into two distinct layers: 111 conventional convolutions for channel interaction (but no local interaction) and depthwise convolutions for local spatiotemporal interactions (but not channel interaction). Channel separation may be applied to any traditional convolution by decomposing it into a 111 convolution and a depthwise convolution.
We introduce the term “channelseparated” to highlight the importance of channel interaction; we also point out that the existing term “depthseparable” is only a good description when applied to tensors with two spatial dimensions and one channel dimension. We note that channelseparated networks have been proposed in Xception [7] and MobileNet [15] for image classification. In video classification, separated convolutions have been used in P3D [22], R(2+1)D [29], and S3D [36], but to decompose 3D convolutions into separate temporal and spatial convolutions. The network architectures presented in this work are designed to separate channel interactions from spatiotemporal interactions.
3.3 Example: ChannelSeparated Bottleneck Block
Figure 2 presents two ways of factorizing a 3D bottleneck block using channelseparated convolutional networks. Figure 2(a) presents a standard 3D bottleneck block, while Figure 2(b) and 2(c) present interactionpreserved and interactionreduced channelseparated bottleneck blocks, respectively.
Interactionpreserved channelseparated bottleneck block is obtained from the standard bottleneck block (Figure 2(a) by replacing the 333 convolution in (a) with a 111 traditional convolution and a 333 depthwise convolution (shown in Figure 2(b)). This block reduces parameters and FLOPs of the traditional 333 convolution significantly, but preserves all channel interactions via a newlyadded 111 convolution. We call this an interactionpreserved channelseparated bottleneck block and the resulting architecture an interactionpreserved channelseparated network (ipCSN).
Interactionreduced channelseparated bottleneck block is derived from the preserved bottleneck block by removing the extra 111 convolution. This yields the depthwise bottleneck block shown in Figure 2(c). Note that the initial and final 111 convolutions (usually interpreted respectively as projecting into a lowerdimensional subspace and then projecting back to the original dimensionality) are now the only mechanism left for channel interactions. This implies that the complete block shown in (c) has a reduced number of channel interactions compared that shown in (a) or (b). We call this design an interactionreduced channelseparated bottleneck block and the resulting architecture an interactionreduced channelseparated network (irCSN).
3.4 Channel Interactions in Convolutional Blocks
The interactionpreserving and interactionreducing blocks in section 3.3 are just two architectures in a large spectrum. In this subsection we present a number of convolutional block designs, obtained by progressively increasing the amount of grouping. The blocks differ in terms of compute cost, parameter count and, more importantly, channel interactions.
Group convolution applied to ResNet blocks. Figure 3(a) presents a ResNet [14] simple block consisting of two 333 convolutional layers. Figure 3(b) shows the simpleG block, where the 333 layers now use grouped convolution. Likewise, Figure 3(c) presents simpleD, with two depthwise layers. Because depthwise convolution requires the same number of input and output channels, we optionally add a 111 convolutional layer (shown in the dashed rectangle) in blocks that change the number of channels.
Figure 4(a) presents a ResNet bottleneck block consisting of two 111 and one 333 convolutional layers. Figures 4(bc) present bottleneckG and bottleneckD where the 333 convolutions are grouped and depthwise, respectively. If we further apply group convolution to the two 111 convolutional layers, the block becomes a bottleneckDG, as illustrated in Figure 4(d). In all cases, the 333 convolutional layers always have the same number of input and output channels.
There are some deliberate analogies to existing architectures here. First, bottleneckG (Figure 4(b)) is exactly a ResNeXt block [35], and bottleneckD is its depthwise variant. BottleneckDG (Figure 4(d)) resembles the ShuffleNet block [38], without the channel shuffle and without the downsampling projection by average pooling and concatenation. The progression from simple to simpleD is similar to moving from ResNet to Xception (though Xception has many more 111 convolutions). We omit certain architecturespecific features in order to better understand the role of grouping and channel interactions.
4 Ablation Experiment
This empirical study will allow us to cast some light on the important factors in the performance of channelseparated network and will lead us to two main findings:

We will empirically demonstrate that within the family of architectures we consider, similar depth and similar channel interaction count implies similar performance. In particular, the interactionpreserving blocks reduce compute by significant margins but preserve channel interactions, with only a slight loss in accuracy for shallow networks and an increase in accuracy for deeper networks.

In traditional 333 convolutions all feature maps interact with each other. Particularly for deeper networks, this causes overfitting.
4.1 Experimental setup
Dataset. We use Kinetics400 [17] for all ablation experiments in this section. Kinetics is a standard benchmark for action recognition in videos. It contains about 260k videos of different human action categories. We use the train split (240k videos) for training and the validation split (20k videos) for evaluating different models.
Base architecture. We use ResNet3D, presented in Table 1, as our base architecture for most of our ablation experiments in this section. More specifically, our model takes clips with a size of L224224 where is the number of frames, is the height and width of the cropped frame. Two spatial downsampling layers (122) are applied at conv1 and at pool1, and three spatiotemporal downsampling (222) are applied at conv3_1, conv4_1 and conv5
_1 via convolutional striding. A global spatiotemporal average pooling with kernel size
77 is applied to the final convolutional tensor, followed by a fullyconnected (fc) layer performing the final classification. We note that in Table 1, are hyperparameters which define network width, while control the network depth.layer name  output size  ResNet3Dsimple  ResNet3Dbottleneck 
conv1  L112112  377, 64, stride 122  
pool1  L5656  max, 133, stride 122  
conv2_x  L5656  
conv3_x  2828  
conv4_x  1414  
conv5_x  77  
pool5  111  spatiotemporal avg pool, fc layer with softmax 
dimensional feature vector. This vector is fed to a fullyconnected layer that outputs the class probabilities through a softmax.
Data augmentation. We use both spatial and temporal jittering for augmentation. More specifically, video frames are scaled such that the shorter edge of the frames become while we maintain the frame original aspect ratio. During training, is randomly picked between and with the same chance. Each clip is then generated by randomly cropping windows of size 224224. Temporal jittering is also applied during training by randomly selecting a starting frame and decoding frames. For the ablation experiments in this section we train and evaluate models with clips of 8 frames () by skipping every other frame (all videos are preprocessed to 30fps, so the newlyformed clips are effectively at 15fps).
Training. We train our models with synchronous distributed SGD on GPU clusters using caffe2 [2] (with 16 machines, each having GPUs). We use a minibatch of clips per GPU, thus making a total minibatch of clips. Following [29]
, we set epoch size to 1M clips due to temporal jitterring augmentation even though the number of training examples is only about 240K. We use the halfcosine period learning rate schedule as presented in
[21] in which the learning at the th iteration is set to , where is the maximum number of training iterations and the initial learning rate is set to . Training is done in epochs where we use model warmingup [12] in the first epochs and the remaining epochs will follow the cosine learning rate schedule.Testing. We report clip top1 accuracy and video top1 accuracy. For video top1, we use center crops of clips uniformly sampled from the video and average these clippredictions to obtain the final video prediction.
4.2 Reducing FLOPs, preserving interactions
In this ablation, we use CSNs to vary both FLOPs and channel interactions. Within this architectural family, channel interactions are a good predictor of performance, whereas FLOPs are not. In particular, FLOPs can be reduced significantly while preserving interaction count.
Table 2 presents results of our interactionreduced CSNs (irCSNs) and interactionpreserved CSNs (ipCSNs) and compare them with the ResNet3D baseline using different number of layers. In the shallow network setting (with 26 layers), both the irCSN and the ipCSN have lower accuracy than ResNet3D. The irCSN provides a computational savings of 7x but causes a drop in accuracy. The ipCSN yields a savings of 6x in FLOPs with a much smaller drop in accuracy (). We note that all of the shallow models have very low count of channel interactions: ResNet3D and ipCSN have about gigapairs, while irCSN has only gigapairs (about 64% of the original). This observation suggests that shallow instances of ResNet3D benefit from their extra parameters, but the preserving of channel interactions decrease the gap for ipCSN.
In deeper settings, both irCSNs and ipCSNs actually outperform ResNet3D by about . Furthermore, the gap between irCSN and ipCSN becomes smaller. We attribute this shrinking of the gap to the fact that, in the 50layer and 101layer configurations, irCSN has nearly the same number of channel interactions as ipCSN since most interactions stem from the 111 layers. One may wonder if ipCSNs outperform ResNet3D and irCSNs because of having more nonlinearities (ReLU). To answer this question, we trained ipCSNs without ReLUs between the 111 and the 333 layers and we observed no notable difference in performance. We can observe that traditional 333 convolutions contain many parameters which can be removed without an accuracy penalty in the deeper models. We investigate this next.
model  depth  video@1  FLOPs  params  interactions 

(%)  
ResNet3D  26  65.3  14.3  20.4  0.42 
irCSN  26  62.4  4.0  1.7  0.27 
ipCSN  26  64.6  5.0  2.4  0.42 
ResNet3D  50  69.4  29.5  46.9  5.68 
irCSN  50  70.3  10.6  13.1  5.42 
ipCSN  50  70.8  11.9  14.3  5.68 
ResNet3D  101  70.6  44.7  85.9  8.67 
irCSN  101  71.3  14.1  22.1  8.27 
ipCSN  101  71.8  15.9  24.5  8.67 
4.3 What makes CSNs outperform ResNet3D?
In section 4.2 we found that both irCSNs and ipCSNs consistently outperform the ResNet3D baseline when there are enough channel interactions, while having fewer parameters and greatly reducing FLOPs. It is natural to ask “what helps CSNs in these scenario?”. Figure 5 helps us answer this question. The plot shows the evolution of the training and validation errors of ipCSN and ResNet3D in both the 50layer and the 101layer configuration. Compared to ResNet3D, ipCSN has higher training errors but lower testing errors. This suggests that the channelseparated convolutions of CSN regularize the model and prevent overfitting.
4.4 The effects of different blocks in group convolutional networks
In this section we start from our base architecture (shown in Table 1) then ablatively replace the convolutional blocks with the blocks presented in section 3.4. We again find that channel interaction plays an important role in understanding the results.
Naming convention. Since the ablation in this section will be considering several different convolutional blocks, to simplify the presentation, we name each architecture by block type (as presented in section 3.4) and total number of blocks, as shown in the last column of Table 3.
Model  block  config  name 

ResNet3D18  simple  [2, 2, 2, 2]  simple8 
ResNet3D26  bottleneck  [2, 2, 2, 2]  bottleneck8 
ResNet3D34  simple  [3, 4, 6, 3]  simple16 
ResNet3D50  bottleneck  [3, 4, 6, 3]  bottleneck16 
Figure 6 presents the results of our convolutional block ablation study. It plots the video top1 accuracy of Kinetics validation set against the model computational cost (# FLOPs). We note that, in this experiment, we use our base architecture with two different number of blocks (8 and 16) and just vary the type of convolutional block and number of groups to study the tradeoffs. Figure 6(a) presents our ablation experiment with simpleX8 and bottleneckX8 architectures (where X can be none, G, or D, or even DG in the case of bottleneck block). Similarly, Figure 6(b) presents our ablation experiment with simpleX16 and bottleneckX16 architectures. We can observe the computation/accuracy effects of the group convolution transformation on our base architectures. Reading each curve from right to left (i.e. in decreasing accuracy), we see simpleX transforming from simple block to simpleG (with increasing number of groups), then to simpleD block. For bottleneckX, reading right to left shows bottleneck block, then transforms to bottleneckG (with increasing groups), bottleneckD, then finally to bottleneckDG (again with increasing groups).
While the general downward trend is expected as we decrease parameters and FLOPs, the shape of the simple and bottleneck curves is quite different. The simpleX models degrade smoothly, whereas bottleneckX stays relatively flat (particularly bottleneck16, which actually increases slightly as we decrease FLOPs) before dropping sharply.
In order to understand better the different behaviors of the simpleXY and bottleneckXY curves (blue vs. red curves) in Figure 6 and the main reason behind the turning points of bottleneckD block (green start markers in Figure 6), we further plot together all of these models in another view: accuracy as a function of channel interactions (Figure 7).
As shown in Figure 7, the number of channel interactions in simpleXY models (blue squares and red diamonds) drops quadratically when group convolution is applied to their 333 layers. In contrast, the number of channel interactions in bottleneckXY models (green circles and purple triangles) drops marginally when group convolution is applied to their 333 since they still have many 111 layers (this can be seen in the presence of two marker clusters which are circled in red: the first cluster includes purple triangles near the topright corner and the other one includes green circles near the center of the figure). The channel interaction in bottleneckXY starts to drop significantly when group convolution is applied to their 111 layers, and causes the model sharp drop in accuracy. This fact explains well why there is no turning point in simpleXY curves and also why there are turning points in bottleneckXY curves. It also confirms the important role of channel interactions in group convolutional networks.
BottleneckD block (also known as irCSN) provides the best computation/accuracy tradeoff. For simple blocks, increasing the number of groups causes a continuous drop in accuracy. However, in the case of the bottleneck block (i.e. bottleneckXY) the accuracy curve remains almost flat as we increase the number of groups until arriving at the bottleneckD block, at which point the accuracy degrades dramatically when the block is turned into a bottleneckDG (group convolution applied to 111 layers). We conclude that a bottleneckD block (or irCSN) gives the best computation/accuracy tradeoff in this family of ResNetstyle blocks, due to its high channelinteraction count.
5 Comparison with the StateoftheArt
In this section, we evaluate our proposed architectures, i.e., irCSNs and ipCSNs, and compare them with stateoftheart methods.
Datasets. We evaluate our CSNs on two public benchmarks: Sports1M [16] and Kinetics [17] (version 1 with 400 action categories). Sports1M is a largescale action recognition dataset which consists of about 1.1 million videos from classes of finegrained sports. Kinetics is a mediumsize dataset which includes about 300K videos of different human action categories. For Sports1M, we use the public train and test splits provided with the dataset. For Kinetics, we use the train split for training and the validation set for testing.
Training and testing. Differently from our ablation experiments in the previous section, here we train our CSNs with frame clip inputs () with a sampling rate of (skipping every other frame) following the practice described in [29]. All the other training settings such as data augmentation and optimization parameters are the same as those described in our previous section. For testing, we uniformly sample clips from each testing video. Each clip is scaled such that its shorter edge become , then cropped to (i.e., each input clip has a size of 32256256). Each crop is passed through the network to be evaluated as in a fullyconvolutional network (FCN). Since our network was trained with a fullyconnected layer, during FCN inference this FC layer is transformed into an equivalent 111 convolutional layer with weights copied from the FC layer.
Results on Sports1M. Table 4 reports result of our irCSNs and compares them with current stateoftheart methods on Sports1M. Our irCSN152 outperforms C3D [28] by , P3D [22] by , Conv Pooling [37] by , and R(2+1)D [29] by on video top1 accuracy while being 24x faster than R(2+1)D. Our irCSN101, even with a smaller number of FLOPs, still outperforms all previous work by good margins.
Method  input  video@1  video@5  GFLOPscrops 
C3D [28]  RGB  61.1  85.2  NA 
P3D [22]  RGB  66.4  87.4  NA 
Conv pool [37]  RGB+OF  71.7  90.4  NA 
R(2+1)D [29]  RGB  73.0  91.5  152dense 
R(2+1)D [29]  RGB+OF  73.3  91.9  305dense 
irCSN101  RGB  74.9  91.6  56.510 
irCSN152  RGB  75.5  92.7  74.010 
Results on Kinetics. We train our proposed CSN models on Kinetics and compare them with current stateoftheart methods. Beside training from scratch, we also finetune our CSNs with weights initialized from models pretrained on Sports1M. For a fair comparison, we compare our CSNs with the methods that use only RGB as input. Table 5 presents the results of our CSNs and compares them with current methods. Our CSNs, even trained from scratch, already outperform all of the previously published work, except for nonlocal networks [34]. Our irCSN152, pretrained on Sports1M, significantly outperforms I3D [3], R(2+1)D [29], and S3DG [36] by , , and , respectively. It also outperforms recent work: Net [4] by , Globalreasoning networks [6] by . Finally, our irCSN152 slightly outperforms nonlocal networks [34] by and SlowFast networks [9] by while being 11x and 3.5x faster Nonlocal and SlowFast networks. Our irCSN152 is still lower than SlowFast networks when it is augmented with nonlocal networks. We note that our CSNs use only 10 crops per testing video while other methods use dense sampling [3, 36, 29], e.g. sample all possible overlapped clips, which normally requires running inference on a few hundreds clips per testing video.
Method  pretrain  video@1  video@5  GFLOPscrops 
ResNeXt [13]  none  65.1  85.7  NA 
ARTNet(d) [31]  none  69.2  88.3  24250 
I3D [3]  ImageNet  71.1  89.3  108dense 
TSM [20]  ImageNet  72.5  90.7  65NA 
MFNet [5]  ImageNet  72.8  90.4  11NA 
InceptionResNet [1]  ImageNet  73.0  90.9  NA 
R(2+1)D [29]  Sports1M  74.3  91.4  152dense 
Net [4]  ImageNet  74.6  91.5  41NA 
S3DG [36]  ImageNet  74.7  93.4  71dense 
D3D [26]  ImageNet  75.9  NA  NA 
GloRe [6]  ImageNet  76.1  NA  55NA 
NL I3D [34]  ImageNet  77.7  93.3  35930 
SlowFast [9]  none  77.9  93.2  10630 
SlowFast+NL [9]  none  79.0  93.6  11530 
irCSN101  none  75.7  92.0  73.810 
ipCSN101  none  76.3  92.0  83.010 
irCSN152  none  76.3  92.1  96.710 
ipCSN152  none  76.5  92.4  108.810 
irCSN101  Sport1M  77.5  93.1  73.810 
irCSN152  Sport1M  78.5  93.4  96.710 
6 Conclusion
We have presented ChannelSeparated Convolutional Networks (CSN) as a way of factorizing 3D convolutions. The proposed CSNbased factorization not only helps to significantly reduce the computational cost, but also improves the accuracy when there are enough channel interactions in the networks. Our proposed architecture, irCSN, significantly outperforms existing methods and obtains stateoftheart accuracy on two major benchmarks: Sports1M and Kinetics. The model is also multiple times faster than current competing networks.
Acknowledgement. The authors would like to thank Kaiming He for providing insightful discussions about the architectures, Haoqi Fan for helping in improving our training infrastructures.
References
 [1] Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of offtheshelf temporal modeling approaches for largescale video classification. CoRR, abs/1708.03805, 2017.

[2]
Caffe2Team.
Caffe2: A new lightweight, modular, and scalable deep learning framework.
https://caffe2.ai/.  [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
 [4] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. A^2nets: Double attention networks. In NeuIPS, pages 350–359, 2018.
 [5] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. Multifiber networks for video recognition. In ECCV, 2018.
 [6] Y. Chen, M. Rohrbach, Z. Yan, S. Yan, J. Feng, and Y. Kalantidis. Graphbased global reasoning networks. In CVPR, 2019.
 [7] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.
 [8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatiotemporal features. In Proc. ICCV VSPETS, 2005.
 [9] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. CoRR, abs/1812.03982, 2018.
 [10] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In NIPS, 2016.
 [11] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional twostream network fusion for video action recognition. In CVPR, 2016.
 [12] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [13] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, 2018.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [15] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.

[16]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei.
Largescale video classification with convolutional neural networks.
In CVPR, 2014.  [17] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
 [18] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [19] I. Laptev and T. Lindeberg. Spacetime interest points. In ICCV, 2003.
 [20] J. Lin, C. Gan, and S. Han. Temporal shift module for efficient video understanding. CoRR, abs/1811.08383, 2018.

[21]
I. Loshchilov and F. Hutter.
SGDR: stochastic gradient descent with restarts.
In ICLR, 2017.  [22] Z. Qiu, T. Yao, , and T. Mei. Learning spatiotemporal representation with pseudo3d residual networks. In ICCV, 2017.
 [23] S. Sadanand and J. Corso. Action bank: A highlevel representation of activity in video. In CVPR, 2012.
 [24] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018.
 [25] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In NIPS, 2014.
 [26] J. C. Stroud, D. A. Ross, C. Sun, J. Deng, and R. Sukthankar. D3D: distilled 3d networks for video action recognition. CoRR, abs/1812.08249, 2018.
 [27] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [28] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
 [29] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
 [30] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
 [31] L. Wang, W. Li, W. Li, and L. V. Gool. Appearanceandrelation networks for video classification. In CVPR, 2018.
 [32] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
 [33] X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In CVPR, 2016.
 [34] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In CVPR, 2018.
 [35] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
 [36] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. In ECCV, 2018.

[37]
J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and
G. Toderici.
Beyond short snippets: Deep networks for video classification.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 4694–4702, 2015.  [38] X. Zhang, X. Zhou, M. Lin, and J. Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.