Video Classification with Channel-Separated Convolutional Networks

04/04/2019 ∙ by Du Tran, et al. ∙ 0

Group convolution has been shown to offer great computational savings in various 2D convolutional architectures for image classification. It is natural to ask: 1) if group convolution can help to alleviate the high computational cost of video classification networks; 2) what factors matter the most in 3D group convolutional networks; and 3) what are good computation/accuracy trade-offs with 3D group convolutional networks. This paper studies different effects of group convolution in 3D convolutional networks for video classification. We empirically demonstrate that the amount of channel interactions plays an important role in the accuracy of group convolutional networks. Our experiments suggest two main findings. First, it is a good practice to factorize 3D convolutions by separating channel interactions and spatiotemporal interactions as this leads to improved accuracy and lower computational cost. Second, 3D channel-separated convolutions provide a form of regularization, yielding lower training accuracy but higher test accuracy compared to 3D convolutions. These two empirical findings lead us to design an architecture -- Channel-Separated Convolutional Network (CSN) -- which is simple, efficient, yet accurate. On Kinetics and Sports1M, our CSNs significantly outperform state-of-the-art models while being 11-times more efficient.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video classification has witnessed much good progress in the last few years. Most of the accuracy improvements have been resulted from the introduction of new powerful architectures [3, 29, 22, 36, 34]. However, many of these architectures are built on relatively expensive 3D spatiotemporal convolutions. Furthermore, these convolutions are typically computed across all the channels in each layer. 3D CNNs have complexity as opposed to the cost of of 2D CNNs. For both foundational and practical reasons, it is natural to ask which parameters in these large 4D kernels matter the most.

Kernel factorizations have been applied in several settings to reduce compute and improve accuracy. For example, several recent video architectures factor 3D convolution in space and time: examples include P3D [22], R(2+1)D [29], and S3D [36]. In these architectures, a 3D convolution is replaced with a 2D convolution (in space) followed by a 1D convolution (in time). This factorization can be leveraged to increase accuracy and/or to reduce computation. In the still-image domain, separable convolution [7] is used to factor the convolution of 2D filters into a pointwise convolution followed by a depthwise convolution. When the number of channels is large compared to , which is usually the case, this reduces FLOPs by for images. For the case of 3D video kernels, the FLOP reduction is even more dramatic: .

Inspired by the accuracy gains and good computational savings demonstrated by 2D separable convolutions in image classification [7, 15, 38], this paper proposes a set of architectures for video classification – 3D Channel-Separated Networks (CSN) – in which all convolutional operations are separated into either pointwise 111 or depthwise 333 convolutions. Our experiments reveal the crucial importance of channel interaction in the design of CSNs. In particular, we show that excellent accuracy/cost balances can be obtained with CSNs by leveraging channel separation to reduce FLOPs and parameters as long as high values of channel interaction are retained. We propose two factorizations, which we call interaction-reduced and interaction-preserved. Compared to 3D CNNs, both our interaction-reduced and interaction-preserved CSNs provide both higher accuracy and FLOP savings of about 6-7 when there is enough channel interaction. We experimentally show that the channel factorization in CSNs acts as a regularizer, leading to a higher training error but better generalization. Finally, we show that our proposed CSNs significantly outperform current state-of-the art methods on Sports1M and Kinetics while being 11 times faster.

2 Related Work

Group convolution. Group convolution was adopted in AlexNet [18] as a way to overcome GPU memory limitations. Depthwise convolution was introduced in MobileNet [15] as an attempt to optimize model size and computational cost for mobile applications. Chollet [7] built an extreme version of Inception [27] based on 2D depthwise convolution, named Xception, where the Inception block was redesigned to include multiple separable convolutions. Concurrently, Xie et al. proposed ResNeXt [35] by equipping ResNet [14] bottleneck blocks with groupwise convolution. Further architecture improvements have also been made for mobile applications. ShuffleNet [38] further reduced the computational cost of the bottleneck block with both depthwise and group convolution. MobileNetV2 [24] improved MobileNet [15] by switching from a VGG-style to a ResNet-style network, and introducing a “reverted bottleneck” block. All of these architectures are based on 2D CNNs and are applied to image classification while our work focuses on 3D group CNNs for video classification.

Video classification. In the last few years, video classification has seen a major paradigm shift, which involved moving from hand-designed features [19, 8, 23, 30]

to deep network approaches that learn features and classify end-to-end 

[28, 16, 25, 10, 32, 33, 11]. This transformation was enabled by the introduction of large-scale video datasets [16, 17] and massively parallel computing hardware, i.e., GPU. Carreira and Zisserman [3] recently proposed to inflate 2D convolutional networks pre-trained on images to 3D for video classification. Wang et al. [34]

proposed non-local neural networks to capture long-range dependencies in videos. ARTNet 

[31] decouples spatial and temporal modeling into two parallel branches. Similarly, 3D convolutions can also be decomposed into a Pseudo-3D convolutional block as in P3D [22] or factorized convolutions as in R(2+1)D [29] or S3D [36]. 3D group convolution was also applied to video classification in ResNeXt [13] and Multi-Fiber Networks [5] (MFNet).

Among previous approaches, our work is most closely related to the following architectures. First, our CSNs are similar to Xception [7] in the idea of using channel-separated convolutions. Xception factorizes 2D convolution in channel and space for object classification, while our CSNs factorize 3D convolution in channel and space-time for action recognition. In addition, Xception uses simple blocks, while our CSNs use bottleneck blocks. The variant ir-CSN of our model shares similarities with ResNeXt [35] and its 3D version [13] in the use of bottleneck block with group/depthwise convolution. The main difference is that ResNext [35, 13] uses group convolution in its 333 layers with a fixed group size (e.g. ), while our ir-CSN uses depthwise convolutions in all 333 layers which makes our architecture fully channel-separated. As we will show in section 4.2, making our network fully channel-separated helps not only to reduce a significant amount of compute, but also to improve model accuracy by better regularization. We emphasize that our contribution includes not only the design of CSN architectures, but also a systematic empirical study of the role of channel interactions in the accuracy of CSNs.

3 Channel-Separated Convolutional Networks

In this section, we discuss the concept of 3D channel-separated networks. Since channel-separated networks use group convolution as their main building block, we first provide some background about group convolution.

3.1 Background

Group convolution. Conventional convolution is implemented with dense connections, i.e., each convolutional filter receives input from all channels of its previous layer, as in Figure 1(a). However, in order to reduce the computational cost and model size, these connections can be sparsified by grouping convolutional filters into subsets. Filters in a subset receive signal from only channels within its group (see Figure 1(b)). Depthwise convolution is the extreme version of group convolution where the number of groups is equal to the number of input and output channels (see figure 1(c)). Xception [7] and MobileNet [15] were among the first networks to use depthwise convolutions. Figure 1 presents an illustration of conventional, group, and depthwise convolutional layers for the case of input channels and output channels.

Figure 1: Group convolution. Convolutional filters can be partitioned into groups with each filter receiving input from channels only within its group. (a) A conventional convolution, which has only one group. (b) A group convolution with 2 groups. (c) A depthwise convolution where the number of groups matches the number of input/output filters, i.e., each group contains only one channel.

Counting FLOPs, parameters, and interactions. Dividing a conventional convolutional filter into groups reduces compute and parameter count by a corresponding factor of . These reductions occur because each filter in a group receives input from only a fraction of the channels from the previous layer. In other words, channel grouping restricts feature interaction: only channels within a group can interact. If multiple group convolutional layers are stacked directly on top of each other, this feature segregation is further amplified as each channel becomes a function of small channel-subsets in all preceding layers. So, while group convolution saves compute and parameters, it also reduces feature interactions.

We propose to quantify the amount of channel interaction as the number of pairs of two input channels that are connected through any output filter. If the convolutional layer has channels and groups of filters, then each filter is connected to input channels. Therefore each filter will have interacting feature pairs. According to this definition, the example convolutions in Figure 1(a)-(c) will have , , and channel interaction pairs, respectively.

Consider a convolutional layer with a kernel spatiotemporal size (e.g. ), groups of filters, input channels,

output channels applied to a spatiotemporal tensor of

voxels. Its number of parameters, FLOPs (floating-point operations), and number of channel interactions can be measured as:

(1)
(2)
(3)

Recall that . We note that while FLOPs and parameter count are popularly used to characterize a layer, the “amount” of channel interaction is typically overlooked. Our study will reveal the importance of this factor.

3.2 Channel Separation

We define channel-separated convolutional networks (CSN) as 3D CNNs in which all convolutional layers (except for conv1) are either 111 conventional convolutions or depthwise convolutions (where, typically, ). Conventional convolutional networks model channel interactions and local interactions (i.e., spatial or spatiotemporal) jointly in their 3D convolutions. Instead, channel-separated networks decompose these two types of interactions into two distinct layers: 111 conventional convolutions for channel interaction (but no local interaction) and depthwise convolutions for local spatiotemporal interactions (but not channel interaction). Channel separation may be applied to any traditional convolution by decomposing it into a 111 convolution and a depthwise convolution.

We introduce the term “channel-separated” to highlight the importance of channel interaction; we also point out that the existing term “depth-separable” is only a good description when applied to tensors with two spatial dimensions and one channel dimension. We note that channel-separated networks have been proposed in Xception [7] and MobileNet [15] for image classification. In video classification, separated convolutions have been used in P3D [22], R(2+1)D [29], and S3D [36], but to decompose 3D convolutions into separate temporal and spatial convolutions. The network architectures presented in this work are designed to separate channel interactions from spatiotemporal interactions.

3.3 Example: Channel-Separated Bottleneck Block

Figure 2 presents two ways of factorizing a 3D bottleneck block using channel-separated convolutional networks. Figure 2(a) presents a standard 3D bottleneck block, while Figure 2(b) and 2(c) present interaction-preserved and interaction-reduced channel-separated bottleneck blocks, respectively.

Interaction-preserved channel-separated bottleneck block is obtained from the standard bottleneck block (Figure 2(a) by replacing the 333 convolution in (a) with a 111 traditional convolution and a 333 depthwise convolution (shown in Figure 2(b)). This block reduces parameters and FLOPs of the traditional 333 convolution significantly, but preserves all channel interactions via a newly-added 111 convolution. We call this an interaction-preserved channel-separated bottleneck block and the resulting architecture an interaction-preserved channel-separated network (ip-CSN).

Interaction-reduced channel-separated bottleneck block is derived from the preserved bottleneck block by removing the extra 111 convolution. This yields the depthwise bottleneck block shown in Figure 2(c). Note that the initial and final 111 convolutions (usually interpreted respectively as projecting into a lower-dimensional subspace and then projecting back to the original dimensionality) are now the only mechanism left for channel interactions. This implies that the complete block shown in (c) has a reduced number of channel interactions compared that shown in (a) or (b). We call this design an interaction-reduced channel-separated bottleneck block and the resulting architecture an interaction-reduced channel-separated network (ir-CSN).

Figure 2: Standard vs. channel-separated convolutional blocks. (a) A standard ResNet bottleneck block. (b) A interaction-preserved bottleneck block: a bottleneck block where the 333 convolution in (a) is replaced by a 111 standard convolution and a 333 depthwise convolution (shown in dashed box). (c) A interaction-reduced bottleneck block, a bottleneck block where the 333 convolution in (a) is replaced with a depthwise convolution (shown in dashed box). We note that channel interaction is preserved in (b) by the 111 convolution, while (c) lost all of the channel interaction in its 33

3 convolution after factorization. Batch norm and ReLU are used after each convolution layer. For simplicity, we omit the skip connections, batch norm, and ReLU.

3.4 Channel Interactions in Convolutional Blocks

The interaction-preserving and interaction-reducing blocks in section 3.3 are just two architectures in a large spectrum. In this subsection we present a number of convolutional block designs, obtained by progressively increasing the amount of grouping. The blocks differ in terms of compute cost, parameter count and, more importantly, channel interactions.

Group convolution applied to ResNet blocks. Figure 3(a) presents a ResNet [14] simple block consisting of two 333 convolutional layers. Figure 3(b) shows the simple-G block, where the 333 layers now use grouped convolution. Likewise, Figure 3(c) presents simple-D, with two depthwise layers. Because depthwise convolution requires the same number of input and output channels, we optionally add a 111 convolutional layer (shown in the dashed rectangle) in blocks that change the number of channels.

Figure 3: ResNet simple block transformed by group convolution. (a) Simple block: a standard ResNet simple block with two 333 convolutional layers. (b) Simple-G block: a ResNet simple block with two 333 group convolutional layers. (c) Simple-D block: a ResNet simple block with two 333 depthwise covolutional layers with an optional 111 convolutional layer (shown in dashed box) added when increasing number of filters is needed. Batch norm and ReLU are used after each convolution layer. For simplicity, we omit the skip connections, batch norm, and ReLU.

Figure 4(a) presents a ResNet bottleneck block consisting of two 111 and one 333 convolutional layers. Figures 4(b-c) present bottleneck-G and bottleneck-D where the 333 convolutions are grouped and depthwise, respectively. If we further apply group convolution to the two 111 convolutional layers, the block becomes a bottleneck-DG, as illustrated in Figure 4(d). In all cases, the 333 convolutional layers always have the same number of input and output channels.

There are some deliberate analogies to existing architectures here. First, bottleneck-G (Figure 4(b)) is exactly a ResNeXt block [35], and bottleneck-D is its depthwise variant. Bottleneck-DG (Figure 4(d)) resembles the ShuffleNet block [38], without the channel shuffle and without the downsampling projection by average pooling and concatenation. The progression from simple to simple-D is similar to moving from ResNet to Xception (though Xception has many more 111 convolutions). We omit certain architecture-specific features in order to better understand the role of grouping and channel interactions.

Figure 4: ResNet bottleneck block transformed by group convolution. (a) A standard ResNet bottleneck block. (b) Bottleneck-G: a ResNet bottleneck block with a 333 group convolutional layer. (c) Bottleneck-D: a bottleneck block with a 333 depthwise convolution (previously named as ir-CSN, the new name of Bottleneck-D is used here for simplicty and analogy with other blocks). (d) Bottleneck-DG: a ResNet bottleneck block with a 333 depthwise convolution and two 111 group convolutions. We note that from (a) to (d), we gradually apply group convolution to the 333 convolutional layer and then the two 111 convolutional layers. Batch norm and ReLU are used after each convolution layer. For simplicity, in the illustration we omit to show skip connections, batch norm, and ReLU.

4 Ablation Experiment

This empirical study will allow us to cast some light on the important factors in the performance of channel-separated network and will lead us to two main findings:

  1. We will empirically demonstrate that within the family of architectures we consider, similar depth and similar channel interaction count implies similar performance. In particular, the interaction-preserving blocks reduce compute by significant margins but preserve channel interactions, with only a slight loss in accuracy for shallow networks and an increase in accuracy for deeper networks.

  2. In traditional 333 convolutions all feature maps interact with each other. Particularly for deeper networks, this causes overfitting.

4.1 Experimental setup

Dataset. We use Kinetics-400 [17] for all ablation experiments in this section. Kinetics is a standard benchmark for action recognition in videos. It contains about 260k videos of different human action categories. We use the train split (240k videos) for training and the validation split (20k videos) for evaluating different models.

Base architecture. We use ResNet3D, presented in Table 1, as our base architecture for most of our ablation experiments in this section. More specifically, our model takes clips with a size of L224224 where is the number of frames, is the height and width of the cropped frame. Two spatial downsampling layers (122) are applied at conv1 and at pool1, and three spatiotemporal downsampling (222) are applied at conv3_1, conv4_1 and conv5

_1 via convolutional striding. A global spatiotemporal average pooling with kernel size

77 is applied to the final convolutional tensor, followed by a fully-connected (fc) layer performing the final classification. We note that in Table 1, are hyper-parameters which define network width, while control the network depth.

layer name output size ResNet3D-simple ResNet3D-bottleneck
conv1 L112112 377, 64, stride 122
pool1 L5656 max, 133, stride 122
conv2_x L5656
conv3_x 2828
conv4_x 1414
conv5_x 77
pool5 111 spatiotemporal avg pool, fc layer with softmax
Table 1: ResNet3D architectures considered in our experiments. Convolutional residual blocks are shown in brackets, next to the number of times each block is repeated in the stack. The dimensions given for filters and outputs are time, height, and width, in this order. are number of blocks implemented at conv2_x, conv3_x, conv4_x, conv5_x, respectively. are hyper-parameters defined the number of filters of convolutional blocks, by default and . The series of convolutions culminates with a global spatiotemporal pooling layer that yields a - or

-dimensional feature vector. This vector is fed to a fully-connected layer that outputs the class probabilities through a softmax.

Data augmentation. We use both spatial and temporal jittering for augmentation. More specifically, video frames are scaled such that the shorter edge of the frames become while we maintain the frame original aspect ratio. During training, is randomly picked between and with the same chance. Each clip is then generated by randomly cropping windows of size 224224. Temporal jittering is also applied during training by randomly selecting a starting frame and decoding frames. For the ablation experiments in this section we train and evaluate models with clips of 8 frames () by skipping every other frame (all videos are pre-processed to 30fps, so the newly-formed clips are effectively at 15fps).

Training. We train our models with synchronous distributed SGD on GPU clusters using caffe2 [2] (with 16 machines, each having GPUs). We use a mini-batch of clips per GPU, thus making a total mini-batch of clips. Following [29]

, we set epoch size to 1M clips due to temporal jitterring augmentation even though the number of training examples is only about 240K. We use the half-cosine period learning rate schedule as presented in 

[21] in which the learning at the -th iteration is set to , where is the maximum number of training iterations and the initial learning rate is set to . Training is done in epochs where we use model warming-up [12] in the first epochs and the remaining epochs will follow the cosine learning rate schedule.

Testing. We report clip top-1 accuracy and video top-1 accuracy. For video top-1, we use center crops of clips uniformly sampled from the video and average these clip-predictions to obtain the final video prediction.

4.2 Reducing FLOPs, preserving interactions

In this ablation, we use CSNs to vary both FLOPs and channel interactions. Within this architectural family, channel interactions are a good predictor of performance, whereas FLOPs are not. In particular, FLOPs can be reduced significantly while preserving interaction count.

Table 2 presents results of our interaction-reduced CSNs (ir-CSNs) and interaction-preserved CSNs (ip-CSNs) and compare them with the ResNet3D baseline using different number of layers. In the shallow network setting (with 26 layers), both the ir-CSN and the ip-CSN have lower accuracy than ResNet3D. The ir-CSN provides a computational savings of 7x but causes a drop in accuracy. The ip-CSN yields a savings of 6x in FLOPs with a much smaller drop in accuracy (). We note that all of the shallow models have very low count of channel interactions: ResNet3D and ip-CSN have about giga-pairs, while ir-CSN has only giga-pairs (about 64% of the original). This observation suggests that shallow instances of ResNet3D benefit from their extra parameters, but the preserving of channel interactions decrease the gap for ip-CSN.

In deeper settings, both ir-CSNs and ip-CSNs actually outperform ResNet3D by about . Furthermore, the gap between ir-CSN and ip-CSN becomes smaller. We attribute this shrinking of the gap to the fact that, in the 50-layer and 101-layer configurations, ir-CSN has nearly the same number of channel interactions as ip-CSN since most interactions stem from the 111 layers. One may wonder if ip-CSNs outperform ResNet3D and ir-CSNs because of having more nonlinearities (ReLU). To answer this question, we trained ip-CSNs without ReLUs between the 111 and the 333 layers and we observed no notable difference in performance. We can observe that traditional 333 convolutions contain many parameters which can be removed without an accuracy penalty in the deeper models. We investigate this next.

model depth video@1 FLOPs params interactions
(%)
ResNet3D 26 65.3 14.3 20.4 0.42
ir-CSN 26 62.4 4.0 1.7 0.27
ip-CSN 26 64.6 5.0 2.4 0.42
ResNet3D 50 69.4 29.5 46.9 5.68
ir-CSN 50 70.3 10.6 13.1 5.42
ip-CSN 50 70.8 11.9 14.3 5.68
ResNet3D 101 70.6 44.7 85.9 8.67
ir-CSN 101 71.3 14.1 22.1 8.27
ip-CSN 101 71.8 15.9 24.5 8.67
Table 2: Channel-Separated Networks vs. ResNet3D baseline. In the 26-layer configuration, the accuracy of ir-CSN is lower than that of the ResNet3D baseline. But ip-CSN, which preserves channel interactions, is nearly on par with the baseline (the drop is only ). In the the 50- and 101-layer configurations, both ir-CSN and ip-CSN outperform the ResNet3D baseline while reducing parameters and FLOPs. ip-CSN consistently outperforms ir-CSN. All models are trained and evaluated on clips of -frames.

4.3 What makes CSNs outperform ResNet3D?

In section 4.2 we found that both ir-CSNs and ip-CSNs consistently outperform the ResNet3D baseline when there are enough channel interactions, while having fewer parameters and greatly reducing FLOPs. It is natural to ask “what helps CSNs in these scenario?”. Figure 5 helps us answer this question. The plot shows the evolution of the training and validation errors of ip-CSN and ResNet3D in both the 50-layer and the 101-layer configuration. Compared to ResNet3D, ip-CSN has higher training errors but lower testing errors. This suggests that the channel-separated convolutions of CSN regularize the model and prevent overfitting.

Figure 5: Training and validation errors as a function of training iterations for CSN and ResNet3D on Kinetics. (a) Networks of 50 layers. (b) Models of 101 layers. CSNs have higher training error, but lower testing error. This suggests that the channel-separated convolutions provide a beneficial regularization, combatting overfitting. The accuracy of these model on the Kinetics validation set are reported in Table 2.

4.4 The effects of different blocks in group convolutional networks

In this section we start from our base architecture (shown in Table 1) then ablatively replace the convolutional blocks with the blocks presented in section 3.4. We again find that channel interaction plays an important role in understanding the results.

Naming convention. Since the ablation in this section will be considering several different convolutional blocks, to simplify the presentation, we name each architecture by block type (as presented in section 3.4) and total number of blocks, as shown in the last column of Table 3.

Model block config name
ResNet3D-18 simple [2, 2, 2, 2] simple-8
ResNet3D-26 bottleneck [2, 2, 2, 2] bottleneck-8
ResNet3D-34 simple [3, 4, 6, 3] simple-16
ResNet3D-50 bottleneck [3, 4, 6, 3] bottleneck-16
Table 3: Naming convention. We name architectures by block name followed by the total number of blocks (see last column). Only two block names are given in this table. More blocks are presented in section 3.4.
Figure 6: ResNet3D accuracy/computation tradeoff by transforming group convolutional blocks. Video top-1 accuracy on the Kinetics validation set against computation cost (# FLOPs) for a ResNet3D with different convolutional block designs. (a) Group convolution transformation applied to simple and bottleneck blocks with shallow architectures with 8 blocks. (b) Group convolution transformation applied to simple and bottleneck blocks with deep architectures with 16 blocks. The bottleneck-D block (marked with green starts) gives the best accuracy tradeoff among the tested block designs. Base architectures are marked with black hexagrams. Best viewed in color.

Figure 6 presents the results of our convolutional block ablation study. It plots the video top-1 accuracy of Kinetics validation set against the model computational cost (# FLOPs). We note that, in this experiment, we use our base architecture with two different number of blocks (8 and 16) and just vary the type of convolutional block and number of groups to study the tradeoffs. Figure 6(a) presents our ablation experiment with simple-X-8 and bottleneck-X-8 architectures (where X can be none, G, or D, or even DG in the case of bottleneck block). Similarly, Figure 6(b) presents our ablation experiment with simple-X-16 and bottleneck-X-16 architectures. We can observe the computation/accuracy effects of the group convolution transformation on our base architectures. Reading each curve from right to left (i.e. in decreasing accuracy), we see simple-X transforming from simple block to simple-G (with increasing number of groups), then to simple-D block. For bottleneck-X, reading right to left shows bottleneck block, then transforms to bottleneck-G (with increasing groups), bottleneck-D, then finally to bottleneck-DG (again with increasing groups).

While the general downward trend is expected as we decrease parameters and FLOPs, the shape of the simple and bottleneck curves is quite different. The simple-X models degrade smoothly, whereas bottleneck-X stays relatively flat (particularly bottleneck-16, which actually increases slightly as we decrease FLOPs) before dropping sharply.

In order to understand better the different behaviors of the simple-X-Y and bottleneck-X-Y curves (blue vs. red curves) in Figure 6 and the main reason behind the turning points of bottleneck-D block (green start markers in Figure 6), we further plot together all of these models in another view: accuracy as a function of channel interactions (Figure 7).

As shown in Figure 7, the number of channel interactions in simple-X-Y models (blue squares and red diamonds) drops quadratically when group convolution is applied to their 333 layers. In contrast, the number of channel interactions in bottleneck-X-Y models (green circles and purple triangles) drops marginally when group convolution is applied to their 333 since they still have many 111 layers (this can be seen in the presence of two marker clusters which are circled in red: the first cluster includes purple triangles near the top-right corner and the other one includes green circles near the center of the figure). The channel interaction in bottleneck-X-Y starts to drop significantly when group convolution is applied to their 111 layers, and causes the model sharp drop in accuracy. This fact explains well why there is no turning point in simple-X-Y curves and also why there are turning points in bottleneck-X-Y curves. It also confirms the important role of channel interactions in group convolutional networks.

Figure 7: Accuracy vs. channel interactions. Plotting the Kinetics validation accuracy of different models with respect to their total number of channel interactions. Channel interactions are presented on a log scale for better viewing. Best viewed in color.

Bottleneck-D block (also known as ir-CSN) provides the best computation/accuracy tradeoff. For simple blocks, increasing the number of groups causes a continuous drop in accuracy. However, in the case of the bottleneck block (i.e. bottleneck-X-Y) the accuracy curve remains almost flat as we increase the number of groups until arriving at the bottleneck-D block, at which point the accuracy degrades dramatically when the block is turned into a bottleneck-DG (group convolution applied to 111 layers). We conclude that a bottleneck-D block (or ir-CSN) gives the best computation/accuracy tradeoff in this family of ResNet-style blocks, due to its high channel-interaction count.

5 Comparison with the State-of-the-Art

In this section, we evaluate our proposed architectures, i.e., ir-CSNs and ip-CSNs, and compare them with state-of-the-art methods.

Datasets. We evaluate our CSNs on two public benchmarks: Sports-1M [16] and Kinetics [17] (version 1 with 400 action categories). Sports-1M is a large-scale action recognition dataset which consists of about 1.1 million videos from classes of fine-grained sports. Kinetics is a medium-size dataset which includes about 300K videos of different human action categories. For Sports-1M, we use the public train and test splits provided with the dataset. For Kinetics, we use the train split for training and the validation set for testing.

Training and testing. Differently from our ablation experiments in the previous section, here we train our CSNs with -frame clip inputs () with a sampling rate of (skipping every other frame) following the practice described in [29]. All the other training settings such as data augmentation and optimization parameters are the same as those described in our previous section. For testing, we uniformly sample clips from each testing video. Each clip is scaled such that its shorter edge become , then cropped to (i.e., each input clip has a size of 32256256). Each crop is passed through the network to be evaluated as in a fully-convolutional network (FCN). Since our network was trained with a fully-connected layer, during FCN inference this FC layer is transformed into an equivalent 111 convolutional layer with weights copied from the FC layer.

Results on Sports-1M. Table 4 reports result of our ir-CSNs and compares them with current state-of-the-art methods on Sports-1M. Our ir-CSN-152 outperforms C3D [28] by , P3D [22] by , Conv Pooling [37] by , and R(2+1)D [29] by on video top-1 accuracy while being 2-4x faster than R(2+1)D. Our ir-CSN-101, even with a smaller number of FLOPs, still outperforms all previous work by good margins.

Method input video@1 video@5 GFLOPscrops
C3D [28] RGB 61.1 85.2 NA
P3D [22] RGB 66.4 87.4 NA
Conv pool [37] RGB+OF 71.7 90.4 NA
R(2+1)D [29] RGB 73.0 91.5 152dense
R(2+1)D [29] RGB+OF 73.3 91.9 305dense
ir-CSN-101 RGB 74.9 91.6 56.510
ir-CSN-152 RGB 75.5 92.7 74.010
Table 4: Comparisons with state-of-the-art architectures on Sports-1M. Our ir-CSN with 101 or 152 layers outperforms all the previous methods by large margins while being 2-4x faster. NA denotes not available.

Results on Kinetics. We train our proposed CSN models on Kinetics and compare them with current state-of-the-art methods. Beside training from scratch, we also fine-tune our CSNs with weights initialized from models pre-trained on Sports1M. For a fair comparison, we compare our CSNs with the methods that use only RGB as input. Table 5 presents the results of our CSNs and compares them with current methods. Our CSNs, even trained from scratch, already outperform all of the previously published work, except for non-local networks [34]. Our ir-CSN-152, pre-trained on Sports1M, significantly outperforms I3D [3], R(2+1)D [29], and S3D-G [36] by , , and , respectively. It also outperforms recent work: -Net [4] by , Global-reasoning networks [6] by . Finally, our ir-CSN-152 slightly outperforms non-local networks [34] by and Slow-Fast networks [9] by while being 11x and 3.5x faster Non-local and Slow-Fast networks. Our ir-CSN-152 is still lower than SlowFast networks when it is augmented with non-local networks. We note that our CSNs use only 10 crops per testing video while other methods use dense sampling [3, 36, 29], e.g. sample all possible overlapped clips, which normally requires running inference on a few hundreds clips per testing video.

Method pretrain video@1 video@5 GFLOPscrops
ResNeXt [13] none 65.1 85.7 NA
ARTNet(d) [31] none 69.2 88.3 24250
I3D [3] ImageNet 71.1 89.3 108dense
TSM [20] ImageNet 72.5 90.7 65NA
MFNet [5] ImageNet 72.8 90.4 11NA
Inception-ResNet [1] ImageNet 73.0 90.9 NA
R(2+1)D [29] Sports1M 74.3 91.4 152dense
-Net [4] ImageNet 74.6 91.5 41NA
S3D-G [36] ImageNet 74.7 93.4 71dense
D3D [26] ImageNet 75.9 NA NA
GloRe [6] ImageNet 76.1 NA 55NA
NL I3D [34] ImageNet 77.7 93.3 35930
SlowFast [9] none 77.9 93.2 10630
SlowFast+NL [9] none 79.0 93.6 11530
ir-CSN-101 none 75.7 92.0 73.810
ip-CSN-101 none 76.3 92.0 83.010
ir-CSN-152 none 76.3 92.1 96.710
ip-CSN-152 none 76.5 92.4 108.810
ir-CSN-101 Sport1M 77.5 93.1 73.810
ir-CSN-152 Sport1M 78.5 93.4 96.710
Table 5: Comparisons with state-of-the-art architectures on Kinetics. Comparisons with state-of-the-art methods on the Kinetics validation set when models are trained from only RGB input. Our ir-CSN-152 outperforms all of the previous methods while being multiple times faster. NA denotes not available.

6 Conclusion

We have presented Channel-Separated Convolutional Networks (CSN) as a way of factorizing 3D convolutions. The proposed CSN-based factorization not only helps to significantly reduce the computational cost, but also improves the accuracy when there are enough channel interactions in the networks. Our proposed architecture, ir-CSN, significantly outperforms existing methods and obtains state-of-the-art accuracy on two major benchmarks: Sports1M and Kinetics. The model is also multiple times faster than current competing networks.

Acknowledgement. The authors would like to thank Kaiming He for providing insightful discussions about the architectures, Haoqi Fan for helping in improving our training infrastructures.

References