CAKES: Channel-wise Automatic KErnel Shrinking for Efficient 3D Network

by   Qihang Yu, et al.

3D Convolution Neural Networks (CNNs) have been widely applied to 3D scene understanding, such as video analysis and volumetric image recognition. However, 3D networks can easily lead to over-parameterization which incurs expensive computation cost. In this paper, we propose Channel-wise Automatic KErnel Shrinking (CAKES), to enable efficient 3D learning by shrinking standard 3D convolutions into a set of economic operations (e.g., 1D, 2D convolutions). Unlike previous methods, our proposed CAKES performs channel-wise kernel shrinkage, which enjoys the following benefits: 1) encouraging operations deployed in every layer to be heterogeneous, so that they can extract diverse and complementary information to benefit the learning process; and 2) allowing for an efficient and flexible replacement design, which can be generalized to both spatial-temporal and volumetric data. Together with a neural architecture search framework, by applying CAKES to 3D C2FNAS and ResNet50, we achieve the state-of-the-art performance with much fewer parameters and computational costs on both 3D medical imaging segmentation and video action recognition.



page 1

page 2

page 3

page 4


Class Feature Pyramids for Video Explanation

Deep convolutional networks are widely used in video action recognition....

V-NAS: Neural Architecture Search for Volumetric Medical Image Segmentation

Deep learning algorithms, in particular 2D and 3D fully convolutional ne...

FPConv: Learning Local Flattening for Point Convolution

We introduce FPConv, a novel surface-style convolution operator designed...

Depth-wise Decomposition for Accelerating Separable Convolutions in Efficient Convolutional Neural Networks

Very deep convolutional neural networks (CNNs) have been firmly establis...

Hybrid Composition with IdleBlock: More Efficient Networks for Image Recognition

We propose a new building block, IdleBlock, which naturally prunes conne...

Saliency Tubes: Visual Explanations for Spatio-Temporal Convolutions

Deep learning approaches have been established as the main methodology f...

Butterfly Transform: An Efficient FFT Based Neural Architecture Design

In this paper, we introduce the Butterfly Transform (BFT), a light weigh...

Code Repositories


This repository contains the code for our AAAI2021 paper CAKES: Channel-wise Automatic KErnel Shrinking for Efficient 3D Networks.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D learning has attracted more and more research attention with the recent advance of deep neural networks. However, conventional 3D convolution layers typically result in expensive computation and suffer from convergence problems due to over-fitting issues and lack of pre-trained weights [4, 37].

To resolve the redundancy in 3D convolutions, many efforts have been investigated to design efficient alternatives. For instance, [31, 40] propose to factorize the 3D kernel and replace the 3D convolution with P3D and (2+1)D convolution, where 2D and 1D convolution layers are applied in a structured manner. Xie et al. [45] suggest that replacing 3D convolutions with low-cost 2D convolutions at the bottom of the network significantly improves recognition efficiency.

Despite their effectiveness for spatial-temporal information extraction, there are several limitations of existing 3D alternatives. Firstly, these methods (e.g., P3D) are specifically tailored to video datasets, where data can be explicitly separated into time and space. However, for volumetric data such as CT/MRI where all three dimensions should be treated equally, conventional spatial-temporal operators can lead to biased information extraction. Secondly, existing operations are still insufficient even for spatial-temporal data since they may exhibit certain levels of redundancy either along the temporal or the spatial dimension, as empirically suggested in [45]. Finally, existing replacements are manually designed, which can be time-consuming and may not lead to optimal results.

To address these issues, we introduce Channel-wise Automatic KErnel Shrinking (CAKES), as a general framework to automatically determine an efficient replacement for existing 3D operations. Specifically, the proposed method simplifies conventional 3D operations by adopting a combination of diverse and economic operations (e.g., 1D, 2D convolutions), where these different operators can extract complementary information to be utilized in the same layer. Unlike previous methods, by shrinking standard 3D kernels in a channel-wise fashion, our approach is not tailored to any specific type of input (e.g., videos), but can be generalized to different types of data and backbone architectures to learn a fine-grained and efficient replacement. Moreover, we provide a new perspective—to formulate kernel shrinking as a path-level selection problem, which could then be solved by Neural Architecture Search (NAS). To accelerate the search process for an optimal replacement configuration given the tremendous search space, we relax the selection of operations to be differentiable, so that the replacement can be determined in one-shot manner during the end-to-end training.

Figure 1: Performance vs. parameters amount (left) and computation cost (right) of different operations on 3D medical image segmentation. Details are in Table 1

The proposed search algorithm delivers high-performance and efficient models. As shown in Fig. 1, evaluated on both 3D medical image segmentation and video action recognition tasks, our method achieves 64.15% average dice score on the MSD pancreas dataset with 9.72M parameters and 87.16G FLOPs, and 47.4% top-1 accuracy on something-something V1 dataset with 37.5M parameters and 50.7G FLOPs. Compared with its 3D baseline, CAKES not only shows superior performance but also reduces the model size (56.80% less on medical and 19.35% less on video) and computational cost (53.76% less on medical and 19.01% less on video) significantly. The proposed method surpasses their 2D/3D/P3D counterparts and achieves state-of-the-art performance.

Our contributions can be summarized into four folds:

(1) We propose a novel method to shrink 3D kernels into heterogeneous yet complementary efficient counterparts at a fine-grained level, which leads to more efficient and flexible alternatives to 3D convolution without sacrificing accuracy.

(2) We propose channel-wise kernel shrinkage, which yields a generic and flexible replacement that can be applied to both spatial-temporal and volumetric data with different backbone architectures.

(3) By formulating kernel shrinkage as a path-level selection problem and relaxing the selection of operations to be differentiable, the replacement can be determined in one-shot manner using NAS as the search platform.

(4) By applying CAKES to different backbone models, we achieve state-of-the-art performance while being much more efficient on both volumetric medical data and video data compared with their 2D/3D/P3D counterparts.

2 Related Work

2.1 Efficient 3D Convolutional Neural Networks

Despite the great advances of 3D CNNs [4, 7, 38], existing 3D networks usually require heavy computational budget. Besides, 3D CNNs also suffer from unstable training due to lack of pre-trained weights [4, 21, 37]. These facts have motivated researchers to find efficient alternatives to 3D convolutions. For example, it is suggested in [25, 39] to apply group convolution [17] and depth-wise convolution [6] to 3D networks to obtain resource-efficient models. Another type of approach suggests replacing each 3D convolution layer with a structured combination of 2D and 1D convolution layers to achieve better performance while being more efficient. For instance,  [31, 40] propose to use a 2D spatial convolution layer followed by a 1D temporal convolution layer to replace a standard 3D convolution layer. Besides, Xie et al. [45] demonstrate that 3D convolutions are not needed everywhere and some of them can be replaced by 2D counterparts. Similar attempts also occur in the medical imaging area [22]. Gonda et al. [9] try to replace consecutive 3D convolution layers through consecutive 2D convolution layers followed by a 1D convolution layer.

Our method differs from these methods by the following folds: (1) Instead of applying homogeneous operations to all channels, we allow assigning complementary heterogeneous operations at channel-wise; and (2) Instead of manual design through trial-and-error, we design a neural architecture search method to automatically optimize the replacement configuration.

2.2 Network Pruning

Network pruning methods investigate the redundancy in deep models by selecting the important connections (according to a certain criterion) to obtain more compact models. Based on the granularity level, network pruning methods can be divided into the following categories: 1) unstructured weight pruning, which aims at identifying and removing unimportant weights inside networks. For instance, Han et al. [13, 14] propose to prune network weights with small magnitude. These methods typically remove weights and connections in an unstructured manner, making it hard to have real speedup without dedicated hardware or libraries; 2) structured pruning methods, which prune at the level of channel or even layers [16, 23, 26], and achieve empirical acceleration on modern computing devices.

In this paper, we study structured kernel shrinking—to shrink a 3D convolution kernel to its optimal sub-kernel. This direction has been rarely studied before. We note our method is similar to weight pruning but unlike previous methods, we perform it in a structured manner, so that the deployment of the shrunk kernels can lead to efficient models.

2.3 Neural Architecture Search

Neural Architecture Search (NAS) aims at automatically discovering better network architectures than human-designed ones. It has been proved successful not only in 2D natural image recognition task [55], but also on other tasks such as segmentation [19] and detection [8]. Besides the success on natural image, there are also some trials on other data formats such as video [34] and 3D medical image [49, 52]

. Earlier NAS algorithms are based on either reinforcement learning 

[1, 55, 56]

or evolutionary algorithm 

[32, 44]. However, these methods often require training each network candidate from scratch and the intensive computational costs hamper its usage especially with limited computational budget. Since parameter sharing scheme was proposed in [30], more and more search methods such as differentiable NAS approaches [5, 20, 46] and one-shot NAS approaches [3, 12, 36] began to investigate how to effectively reduce the search cost to several GPU days or even several GPU hours.

Moreover, [24] have discussed the relationship between network pruning and NAS, and [10, 27] successfully borrow ideas from network pruning and design more efficient search methods. Some NAS methods [27, 36] also incorporate the kernel size into the search space. Nevertheless, most of them only consider simple cases with choices among , , , while we consider much more diverse and general kernel deployment across different channels in 3D settings.

3 Method

3.1 Revisit variants of 3D convolution

We first revisit 3D convolutions and existing alternatives. Without loss of generality, let X of size

denotes the input tensor, where

stands for the input channel number, and , , represent the spatial depth (or temporal length), the spatial height, and the spatial width, respectively. The weights of the corresponding 3D kernel are denoted as , where is the output channel number and denote the kernel size. Therefore, the output tensor Y of shape can be derived as following:


where denotes convolution, is the output channel index, i.e., .

As Eqn. (1) suggests, the computation overhead of 3D convolutions is significantly heavier than 2D counterparts. As a consequence, the expensive computation and over-parameterization induced by 3D deep networks impede the scalability of network capacity. Recently, there are many works seeking to alleviate the high demand of 3D convolutions. One common strategy is to decouple the spatial and temporal components [31, 40]. The underlying assumption here is that the spatial and temporal kernels are orthogonal to each other, and therefore can effectively extract complementary information from different dimensions. Another option is to discard 3D convolutions and simply use 2D operations instead [45]. Mathematically speaking, these replacements can be written as:


where indicates the replacement operation. Similar ideas also occur in 3D medical image analysis, where the images are volumetric data. For instance, it is shown in [21] that using 2D convolutions in encoder and replacing 3D convolutions with P3D operations in decoder not only largely reduce the computation overhead but also improve the performance over the traditional 3D networks.

Though these methods have furthered the model efficiency compared with standard 3D convolutions, there are several limitations to be tackled. On the one hand, as shown in Eqn. (2), decomposing the kernels into orthogonal 2D and 1D components is specifically designed to extract spatial-temporal information, which may not well generalize to volumetric data. On the other hand, directly replacing 3D kernels with 2D operators (Eqn. (3)) cannot effectively capture information along the third dimension. To address these issues, we propose Channel-wise Automatic KErnel Shrinking (CAKES), to automatically search for an efficient 3D replacement which can deal with any general type of input. Our core idea is to shrink standard 3D kernels into a set of cheaper 1D, 2D, and 3D components. Besides, the shrunk kernels are channel-specific, which we refer to as Channel-wise Shrinkage. Next we will describe how to formulate kernel shrinking to path-level selection [20] in Sec. 3.2. The details of channel-wise shrinkage and search method will be elaborated in Sec. 3.3 and Sec. 3.4.

3.2 Kernel Shrinking as Path-level Selection

Let’s consider the case for only one channel, and abbreviate to for simplicity. We aim to find the optimal sub-kernel (,,) as the substitute for 3D kernel . Therefore, the original 3D kernels can be effectively reduced to smaller sub-kernels, leading to a more flexible and efficient design.

Given a 3D kernel , a natural design choice for kernel shrinking is to represent it as the summation of its sub-kernel and the remainder:


where we denote the index set of within the original 3D convolution as . The remainder after removing sub-kernel is . is a index tensor, and is the indicator function. If the following inequality holds ( is a small constant):


where the left-hand side can be deemed as an importance measure with as the norm constant, which is similar to unstructured weight pruning [14], yet here we aim for a structured sub-kernel, then we can remove the remainder and reduce the 3D kernel to its sub-kernel as a replacement:


However, as in Fig. 2(a), even considering only kernel sizes, there are sub-kernel options for a 3D kernel, which makes it impractical to find the optimal sub-kernel via manual designs.

Figure 2: (a) Various sub-kernels of the same 3D kernel. (b) Representation of 3D kernel as weighted summation of sub-kernels. (c) Path-level selection

Therefore, we provide a new perspective—to formulate this problem as path-level selection [20], i.e., to encode sub-kernels into a multi-path super-network and select the optimal path among them (Fig. 2(c)). Then this problem can be solved by neural architecture search algorithms.

We first represent a 3D kernel as follows (Fig. 2(b)):


where is the weight of -th sub-kernel , , , . With this formulation, the problem of finding the optimal sub-kernel of becomes finding an optimal set of and then keeping the sub-kernel with maximum . Due to the linearity of convolution, Eqn. (1) can then be derived as below:


To solve for the path weights , we reformulate Eqn. (8) as an over-parameterized multi-path super-network, where each candidate path consists of a sub-kernel (Fig. 2). By relaxing the selection space, i.e., relaxing the conditions on to be continuous, Eqn. (8) can be then formulated as a differential NAS problem and optimized via gradient descent [20].

Figure 3: An illustrative example of comparison between different types of convolution in a residual block [15]. (a) 2D Convolution. (b) 3D Convolution. (c) P3D Convolution. (d) the proposed CAKES. In our case, starting from a 3D convolution, the 3D operation at each channel is replaced with an efficient sub-kernel

3.3 Channel-wise Shrinkage

While previous replacements [21, 31, 40] consist of homogeneous operations in the same layer, we argue that a more efficient replacement requires customized operations at each channel. As shown in Fig. 3, kernel shrinking in a channel-wise fashion can generate heterogeneous operations which extract diverse and complementary information to be employed in the same layer, and thereby yields a fine-grained and thus more efficient replacement (Fig. 3(d)) than prior methods which use layer-wise replacements (Fig. 3(a) & (b) & (c)).

Contrary to previous layer-wise replacement, our core idea is to replace 3D kernel at each channel individually, thus the target is to find the optimal sub-kernel as the substitute for the -th 3D kernel :


where the optimal size of the sub-kernel () is subjected to , , . Hence the computation incurred by Eqn. (1) can be largely reduced by our replacement as above.

With our channel-wise replacement design, the original 3D kernels are substituted by a series of diverse and cheaper operations at different channels as following (recall that is the output channel number):


Benefited from channel-wise shrinkage, our method provides a more general and flexible design for replacing 3D convolution than previous approaches (Eqn. (2) and Eqn. (3)), where it can also be easily reduced to arbitrary alternatives (, 2D, P3D) by integrating these operations into the set of candidate sub-kernels. An illustration example can be found in Fig 3.

3.4 Neural Architecture Search for an Efficient Replacement

As aforementioned, given the tremendous search space, it is impractical to manually find the optimal replacement for a 3D kernel through a trial-and-error process. Especially, it becomes even more intractable as the replacement procedure is conducted in a channel-wise manner. Therefore, we propose to automate the process of learning an efficient replacement to fully exploit the redundancies in 3D convolution operations. By formulating kernel shrinkage as a path-level selection problem, we first construct a super-network where every candidate sub-kernel is encapsulated into a separate trainable branch (Fig. 2(c)) at each channel. Once the path weights are learned by differentiable NAS [20], the optimal path (sub-kernel) can be determined.

Search Space. A well-designed search space is crucial for NAS algorithms. We aim to answer the following questions: Should the 3D convolution kernel be kept or replaced per channel? If replaced, which operation should be deployed instead?

To address these questions, for each channel, we define a set , which contains all candidates of sub-kernels (replacement) given a 3D kernel :


As the original 3D convolution kernel is a sub-kernel of itself, i.e., , it can be kept in the final configuration. The final optimal operation is chosen among .

Another critical problem for NAS is how to reduce the search cost. To make the search cost affordable, we adopt a differentiable NAS paradigm where the model structure is discovered in single-pass super-network training. Drawing inspirations from previous NAS methods, we directly use the scaling parameters in the normalization layers as the path weights of the multi-path super-network (Eqn. (8)) [10, 27]. And our goal is then equivalent to finding the optimal sub-network architecture based on the learned path weights. To achieve this goal, we introduce two different search algorithms which aim at either maximizing the performance or optimizing the computation cost of the sub-network as a search priority, named as performance-priority and cost-priority search, respectively.

Performance-Priority Search. As the title implies, the search is performed in a “performance-priority” manner, which means to maximize the performance by finding the optimal sub-kernels given the backbone architecture. During the search procedure, following [2, 3], we randomly pick an operation for each channel at each training iteration. This not only allows for memory saving by only activating and updating one path per iteration but also propels the weights of the paths in the super-network training to be decoupled. After the super-network is trained, the operation with the largest path weight will be picked as the final choice for the given output channel:


Cost-Priority Search. Performance-priority search regards the performance as the search priority and may neglect the possible negative effects on the computation cost. In order to obtain more compact models, we introduce a “cost-priority” search method. Following [27], the output of each sub-kernels are concatenated and aggregated by the following convolution. To make the searched architecture more compact, we introduce a “cost-aware” penalty term—A lasso term on which is used as the penalty loss to push many path weights to near-zero values. Therefore, the total training loss can be written as:


where is a “cost-aware” term to balance the penalty term, which is proportional to the parameters or FLOPs cost of the sub-kernel. In Table 1, we also empirically show that this term can lead to a more efficient architecture. The introduction of aims at giving more penalty to “expensive” operations and leading to a more efficient replacement. is the coefficient of the penalty term, and is the conventional training loss (, cross-entropy loss combined with the regularization term such as weight decay).

4 Experiments

4.1 3D Medical Image Segmentation

Dataset. We evaluate the proposed method on two public datasets: 1) Pancreas Tumours dataset from the Medical Segmentation Decathlon Challenge (MSD) [35], which contains 282 cases with both pancreatic tumours and normal pancreas annotations; and 2) NIH Pancreas Segmentation dataset [33], consisting of 82 abdominal CT volumes. For the MSD dataset, we use 226 cases for training and evaluate the segmentation performance on the rest 56 cases. The resolution along the axial axis of this dataset is extremely low and the number of slices can be as small as 37. For data preprocessing, all images are resampled to an isotropic 1.0 resolution. For the NIH dataset, the resolution of each scan is , where is the number of slices along the axial axis and the voxel spacing ranges from 0.5 to 1.0 . We test the model in a 4-fold cross-validation manner following previous methods [51].

Methods Type Params (M) FLOPs (G) Pancreas DSC (%) Tumor DSC (%) Average DSC (%)
manual 11.29 97.77 79.16 43.02 61.09
manual 22.50 188.48 80.34 47.57 63.96
manual 13.16 112.88 45.27 62.82
manual 7.56 67.53 79.77 42.73 61.25
manual 11.29 97.77 80.09 46.17 63.13
manual 11.41 99.17 79.82 45.27 62.55
auto 80.32 45.57 62.95
auto 11.29 97.77 80.05 48.51 64.28
auto 11.26 99.68 80.12
auto 9.72 87.16 80.34 47.95 64.15
Table 1: Comparison among different operations and configurations. The subscripts of 1D, 2D, and 3D indicate the dimensions of the operations being used. denotes “cost-priority” design

Implementation Details. For all experiments, C2FNAS [49] is used as the backbone architecture and the search is performed in “performance-priority” manner unless otherwise specified. When replacing the operations, we keep the stem (the first two and the last two convolution layers) as the same. For 3D medical image, for simplicity, we choose a set of most representative sub-kernels as . The operations set contains conv1D (, , ), conv2D (, , ) from different directions, and conv3D (). For every 3D kernel at each output channel, a sub-kernel from will be chosen as the replacement. For NAS settings, we include both “performance-priority” and “cost-priority” search for performance comparison. For manual settings, we assign all candidates operations uniformly across the output channels.

Training stage. For the MSD dataset, we use random crop with patch size of , random rotation (, , , and

) and flip in all three axes as data augmentation. The batch size is 8 with 4 GPUs. We use SGD optimizer with learning rate starting from 0.01 with polynomial decay of power of 0.9, momentum of 0.9, and weight decay of 0.00004. The loss function is the summation of Dice loss 

[28] and cross-entropy loss. For NIH dataset, the patch size is set as following [52]. The found architecture will be trained from scratch to ensure its effectiveness. Both the super-network and the found architecture are trained under the same settings as aforementioned. For search stage with “cost-priority” setting, a lasso term with coefficient is applied to the path weights, which is further re-weighted by for 3D, 2D, 1D operations respectively. After training finishes, the operation with the largest is chosen as the final replacement for 3D operation for each channel.

Testing stage. We test the network in a sliding-window manner, where the patch size is

and stride is

for the MSD dataset and patch size is and stride is for NIH dataset. The result is measured with Dice-Sørensen coefficient (DSC) metric, which is formulated as , where and denote the prediction and ground-truth voxels set for a foreground class. The DSC has a range of with 1 implying a perfect prediction.

Method Type Params Average DSC Max DSC Min DSC
Zhou et al. [51] Manual 268.56M 82.37% 90.85% 62.43%
Oktay et al. [29] Manual 103.88M 83.10% - -
Yu et al. [47] Manual 268.56M 84.50% 91.02% 62.81%
Zhu et al. [53] Manual 20.06M 84.59% 91.45% 69.62%
Zhu et al. [52] Auto 29.74M 85.15% 91.18% 70.37%
Auto 84.85% 91.61% 59.32%
Auto 11.26M
Table 2: Performance comparison with prior arts on the NIH dataset

NAS Settings vs. Manual Settings. As can be observed from Table 1, even under manual settings, CAKES is already more efficient with slightly inferior performance (, from to manual , parameters drop from 22.50M to 11.29M, and FLOPs drop from 188.48G to 97.77G, with performance gap of 1.0%). Besides, under the manual settings outperforms its counterpart with standard convolution layers () by more than with the same model size, which indicates the benefits of our design. In addition, under NAS settings, the proposed search method can further reduce the performance gap and even surpasses original 3D model with much fewer parameters and computations, e.g., model size is reduced from 22.50M () to 11.26M (), and FLOPs drop from 188.48G () to 99.68G (), with a performance improvement of 0.46%. Compared with , also yields superior performance (+1.60%) with a more compact model (11.26M vs. 13.16M), which further indicates the effectiveness of the proposed method.

Influence of the Search Space. From Table 1, we can see that using different search space, our proposed CAKES consistently outperforms its counterpart with standard 1D/2D/3D convolutions. For instance, compared with and , our CAKES leads to a significant performance improvement (+1.70% for , +1.15% for ) with same model size and computation cost. Out of different search spaces, we find that (7.56M params and 67.53G FLOPs) offers the most efficient model with slightly worse performance, while (11.29M params and 97.77G FLOPs) can already surpass the 3D baseline (22.50M params and 188.48G FLOPs) with half parameters and computation cost. After we enlarge the search space, successfully finds a configuration with even higher performance/efficiency (last 2 rows of Table 1).

Generalization to different backbone architectures. We also test our method on different backbone architectures. Applying to another state-of-the-art model 3D ResDSN [53], our method consistently leads to a more efficient model with much fewer parameters (10.03M to 4.63M) and FLOPs (192.07G to 98.12G) with comparable performance (61.96% to 61.65%).

NIH Results. We compare the proposed CAKES with state-of-the-art methods in Table 2, where it can be observed that the proposed CAKES leads to a much more compact model size compared to other alternatives. for instance, our model size is more than smaller than [51] and [47]. It is well worth noting that our model even performed in a single-stage fashion already outperforms many state-of-the-art methods conducted in a two-stage coarse-to-fine manner [51, 48, 53] on the NIH pancreas dataset with much fewer model parameters and FLOPS. It is also noteworthy to mention that the applied architecture is searched from another dataset (MSD), where the images are collected under different protocols and have different resolutions. This result indicates the generalization of our searched model. By directly applying the architecture searched on the MSD dataset, our method also outperforms [52] which was directly searched on the NIH dataset with less than parameters of [52].

4.2 Action Recognition in Videos

Dataset. Something-Something V1 [11] is a large scale action recognition dataset which requires comprehensive temporal modeling. There are totally about 110k videos for 174 classes with diverse objects, backgrounds, and viewpoints.

Implementation Details. We adopt ResNet50 [15]

with pre-trained weight on ImageNet 

[17] as our backbone. The 3D convolution weights are initialized by repeating 2D kernel by 3 times along the temporal dimension following [4], while 1D convolution weights are initialized by averaging the 2D kernel on spatial dimensions and then repeat by 3 times along temporal axis. For the temporal dimension, we use the sparse sampling method as in TSN [41]. And for spatial dimension, the short side of the input frames are resized to 256 and then cropped to .

Training stage.

We use random cropping and flipping as data augmentation. We train the network with a batch size of 96 on 8 GPUs with SGD optimizer. The learning rate starts from 0.04 for the first 50 epochs and decays by a factor of 10 for every 10 epochs afterwards. The total training epochs are 70. We also set dropout ratio to 0.3 following 

[43]. The training settings remain the same for both final network and search stage, except that when searching with “cost-priority”, we use a lasso term with and for 3D, 2D, 1D operations respectively.

Testing stage. we sample the middle frame in each segment and do center crop for each frame. We report the results of single crop, unless otherwise specified.

Model Type Params (M) FLOPs (G) top1 top5
C2D manual 23.9 33.0 17.2 43.1
P3D manual 27.6 37.9 44.8 74.6
C3D manual 46.5 62.6 46.8 75.3
manual 46.2 75.2
manual 35.2 47.7 46.8 76.0
auto 21.2 29.5 46.3 74.9
auto 37.5 50.7
auto 33.5 43.9 47.2 75.7
auto 20.5 29.3 46.8 76.0
auto 35.7 41.4 46.9 75.6
auto 35.0 38.7 46.9 75.5
Table 3: Comparison among operations and configurations for ResNet50 backbone in terms of parameter number, computation amount (FLOPs), and performance on something-something V1 dataset. denotes “cost-priority” design
Model Backbone #Frame FLOPs #Param. top1 top5
TSN [41] ResNet-50 8 33G 24.3M 19.7 46.6
TRN-2stream [50] BNInception 8+8 - 36.6M 42.0 -
ECO [54] BNIncep+3D Res18 8 32G 47.5M 39.6 -
ECO [54] BNIncep+3D Res18 16 64G 47.5M 41.4 -
 [54] BNIncep+3D Res18 92 267G 150M 46.4 -
I3D [4] 3D ResNet-50 322clip 153G2 28.0M 41.6 72.2
NL I3D [42] 3D ResNet-50 322clip 168G2 35.3M 44.4 76.0
NL I3D+GCN [43] 3D ResNet-50+GCN 322clip 303G2 62.2M 46.1 76.8
TSM [18] ResNet-50 8 33G 24.3M 45.6 74.2
TSM [18] ResNet-50 16 65G 24.3M 47.2 77.1
S3D [45] BNInception 64 66.38G - 47.3 78.1
S3D-G [45] BNInception 64 71.38G - 48.2 78.7
ResNet-50 8 29.3G 20.5M 46.8 76.0
ResNet-50 8 50.7G 37.5M 47.4 76.1
ResNet-50 8 43.9G 33.5M 47.2 75.7

ResNet-50 16 58.6G 20.5M 48.0 78.0
ResNet-50 16 101.4G 37.5M 48.6 78.6
ResNet-50 16 87.8G 33.5M 49.4 78.4
Table 4: Comparing CAKES against other methods on Something-Something dataset. We mainly consider the methods that adopt convolutions in fully-connected manner and only take RGB as input for fair comparison

Ablation Study. We study the impacts of both different operations set and manual/auto configurations. The results are summarized in Table 3. Considering the spatial-temporal property of video data, we study the following different operations set: (1) Spatial 2D convolution and temporal 1D convolution; (2) Spatial 2D convolution and 3D convolution; (3) Spatial 2D, temporal 1D, and 3D convolutions.

Operation set with 1D & 2D sub-kernels. As shown in Table 3, surpass the 2D baseline by a large margin (46.8% vs. 17.2%), while having a smaller model size (20.5M vs. 23.9M). This suggests that TSN [41] may lack the ability to capture temporal information, therefore replacing some of the 2D operations to temporal 1D operations can significantly increase the performance and reduce the model size. Besides, it also surpasses P3D, where each 2D convolution is followed by a temporal 1D convolution, with a significant advantage on both performance (46.8% vs. 44.8%) and model cost (20.5M params and 29.3G FLOPs compared to 27.6M params and 62.6G FLOPs), indicating makes better use of redundancies in the network than P3D. Therefore, using operation set containing 1D and 2D sub-kernels can be an ideal design when looking for efficient video understanding networks.

Operation set with 2D & 3D sub-kernels. We aim to see how does balance the trade-off between performance and model cost. From Table 3, under the “cost-priority” setting, yields a much more compact model (35.7M and 41.4G FLOPs) with a comparable performance to C3D. When it comes to the “performance-priority” setting, searches a slightly larger model (37.5M params and 50.7G FLOPs), yet its performance boosts significantly to 47.4%.

Operation set with 1D & 2D & 3D sub-kernels. Compared to , shows a slightly inferior performance (-0.5%) with much fewer FLOPs (38.7G vs. 50.7G). Besides, under the “performance-priority” setting, produces a comparable performance to with much less computation cost (43.9G vs. 50.7G). This result indicates that with a more general search space (, 1D, 2D, and 3D), the proposed CAKES can find more flexible designs, which lead to better performance/efficiency.

Results. A comparison with other state-of-the-art methods is shown in Table 4. We report the model performance under both 8-frame and 16-frame settings. Compared with other state-of-the-art methods, sampling only 8 frames can already outperform most current methods. With smaller parameters and FLOPs, surpasses those complex models such as non-local networks [42] with graph convolution [43]. Comparing to other efficient video understanding framework such as ECO [54] and TSM [18], our model is not only more light-weight (58.6G vs. 64G/65G), but also delivers a better performance (48.0% vs. 41.4%/47.2%). And our best model achieves a new state-of-the-art performance of 49.4% top-1 accuracy with a moderate model size. An interesting finding is that although shows similar performances to with 8-frame inputs, it achieves a much higher accuracy when it comes to the 16-frame scenario, which demonstrates that with a more general search space, also shows a stronger transfer-ability than other counterparts.

Figure 4: The searched architecture of (a) on medical data and (b) on video data. Each color represents a type of sub-kernel. The heights of these blocks are proportional to their ratios in the corresponding convolution layer. The beginning and ending convolutions at each residual block are not visualized
Figure 5: The searched architecture of (a) on medical data and (b) on video data. Each color represents a type of sub-kernel. The heights of these blocks are proportional to their ratios in the corresponding convolution layer. The beginning and ending convolutions at each residual block are not visualized

4.3 Analysis

Cost-Priority Architectures. We plot the found architecture on both medical data () and video data () respectively in Fig 4. For the architecture searched on something-something dataset, we note that the algorithm prefers efficient 2D operations at the bottom of the network, and favors 3D operation at the top of the network. This implies that the search algorithm successfully find that temporal information extracted from high-level features is more useful, which coincides with the observation in [45]. For the architecture found on the MSD dataset, we calculated the number of operations computed on all three data dimensions, and numbers are (, , ). This suggests that the searched model, unlike the searched network for videos, tends to treat each dimension equally, which aligns with the property of volumetric medical data. In addition, the number of 1D, 2D, 3D operations are , , and respectively, indicating that the efficient 1D/2D operations are more preferred.

Performance-Priority Architectures. As shown in Fig. 4(b), for , the pure temporal sub-kernel () is rarely chosen at the top of the network, while it plays a more important role as the network goes deeper. This observation agrees with our previous finding: temporal information extracted from high-level features is more useful. For on medical image as shown in Fig. 4(a), we calculate the numbers of each type of sub-kernels and computation on three axes. The numbers of 1D, 2D, and 3D sub-kernel are , , and the number of operations computed on all three data dimensions are , , , which again coincides with our previous finding that tends to treat each dimension equally for the symmetric volumetric data. Compared to cost-priority , we notice that performance-priority favors the operation set with more 3D sub-kernels, which can provide a larger model capacity.

5 Conclusions

As an important solution to various 3D vision applications, 3D networks still suffer from over-parameterization and heavy computations. How to design efficient alternatives to 3D operations remain an open problem. In this paper, we propose Channel-wise Automatic KErnel Shrinking (CAKES), where standard 3D convolution kernels are shrunk into efficient sub-kernels at channel-level to obtain efficient 3D models. Besides, by formulating kernel shrinkage as a path-level selection problem, our method can automatically explore the redundancies in 3D convolutions and optimize the replacement configuration. By applying on different backbone models, the proposed CAKES significantly outperforms previous 2D/3D/P3D and other state-of-the-art methods on both 3D medical image segmentation and action recognition from videos.


  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2017) Designing neural network architectures using reinforcement learning. ICLR. Cited by: §2.3.
  • [2] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, pp. 550–559. Cited by: §3.4.
  • [3] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2018) SMASH: one-shot model architecture search through hypernetworks. ICLR. Cited by: §2.3, §3.4.
  • [4] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308. Cited by: §1, §2.1, §4.2, Table 4.
  • [5] X. Chen, L. Xie, J. Wu, and Q. Tian (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. ICCV. Cited by: §2.3.
  • [6] F. Chollet (2017)

    Xception: deep learning with depthwise separable convolutions

    In CVPR, pp. 1251–1258. Cited by: §2.1.
  • [7] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In MICCAI, Cited by: §2.1.
  • [8] G. Ghiasi, T. Lin, and Q. V. Le (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In CVPR, pp. 7036–7045. Cited by: §2.3.
  • [9] F. Gonda, D. Wei, T. Parag, and H. Pfister (2018) Parallel separable 3d convolution for video and volumetric data understanding. BMVC. Cited by: §2.1.
  • [10] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi (2018) Morphnet: fast & simple resource-constrained structure learning of deep networks. In CVPR, pp. 1586–1595. Cited by: §2.3, §3.4.
  • [11] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. (2017) The” something something” video database for learning and evaluating visual common sense.. In ICCV, Vol. 1, pp. 5. Cited by: §4.2.
  • [12] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single path one-shot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §2.3.
  • [13] S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §2.2.
  • [14] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In NeurIPs, pp. 1135–1143. Cited by: §2.2, §3.2.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Figure 3, §4.2, CAKES: Channel-wise Automatic KErnel Shrinking for Efficient 3D Network.
  • [16] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In ICCV, pp. 1389–1397. Cited by: §2.2.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §2.1, §4.2.
  • [18] J. Lin, C. Gan, and S. Han (2019) Tsm: temporal shift module for efficient video understanding. In ICCV, pp. 7083–7093. Cited by: §4.2, Table 4.
  • [19] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §2.3.
  • [20] H. Liu, K. Simonyan, and Y. Yang (2019) Darts: differentiable architecture search. ICLR. Cited by: §2.3, §3.1, §3.2, §3.2, §3.4.
  • [21] S. Liu, D. Xu, S. K. Zhou, T. Mertelmeier, J. Wicklein, A. Jerebko, S. Grbic, O. Pauly, W. Cai, and D. Comaniciu (2017) 3D anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. MICCAI. Cited by: §2.1, §3.1, §3.3.
  • [22] S. Liu, D. Xu, S. K. Zhou, O. Pauly, S. Grbic, T. Mertelmeier, J. Wicklein, A. Jerebko, W. Cai, and D. Comaniciu (2018) 3d anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. In MICCAI, pp. 851–858. Cited by: §2.1.
  • [23] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In ICCV, pp. 2736–2744. Cited by: §2.2.
  • [24] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2019) Rethinking the value of network pruning. ICLR. Cited by: §2.3.
  • [25] C. Luo and A. Yuille (2019) Grouped spatial-temporal aggretation for efficient action recognition. In ICCV, Cited by: §2.1.
  • [26] J. Luo, J. Wu, and W. Lin (2017) Thinet: a filter level pruning method for deep neural network compression. In ICCV, pp. 5058–5066. Cited by: §2.2.
  • [27] J. Mei, Y. Li, X. Lian, X. Jin, L. Yang, A. Yuille, and J. Yang (2020) AtomNAS: fine-grained end-to-end neural architecture search. ICLR. Cited by: §2.3, §3.4, §3.4.
  • [28] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pp. 565–571. Cited by: §4.1.
  • [29] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al. (2018) Attention u-net: learning where to look for the pancreas. MIDL. Cited by: Table 2.
  • [30] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: §2.3.
  • [31] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In ICCV, pp. 5533–5541. Cited by: §1, §2.1, §3.1, §3.3.
  • [32] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    In AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.3.
  • [33] H. R. Roth, L. Lu, A. Farag, H. Shin, J. Liu, E. B. Turkbey, and R. M. Summers (2015) Deeporgan: multi-level deep convolutional networks for automated pancreas segmentation. In MICCAI, Cited by: §4.1.
  • [34] M. S. Ryoo, A. Piergiovanni, M. Tan, and A. Angelova (2019) Assemblenet: searching for multi-stream neural connectivity in video architectures. arXiv preprint arXiv:1905.13209. Cited by: §2.3.
  • [35] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. van Ginneken, A. Kopp-Schneider, B. A. Landman, G. Litjens, B. Menze, et al. (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §4.1.
  • [36] D. Stamoulis, R. Ding, D. Wang, D. Lymberopoulos, B. Priyantha, J. Liu, and D. Marculescu (2019) Single-path nas: designing hardware-efficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2.3, §2.3.
  • [37] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang (2016) Convolutional neural networks for medical image analysis: full training or fine tuning?. TMI 35 (5), pp. 1299–1312. Cited by: §1, §2.1.
  • [38] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: §2.1.
  • [39] D. Tran, H. Wang, L. Torresani, and M. Feiszli (2019) Video classification with channel-separated convolutional networks. In ICCV, pp. 5552–5561. Cited by: §2.1.
  • [40] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, pp. 6450–6459. Cited by: §1, §2.1, §3.1, §3.3.
  • [41] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, pp. 20–36. Cited by: §4.2, §4.2, Table 4.
  • [42] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §4.2, Table 4.
  • [43] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In ECCV, pp. 399–417. Cited by: §4.2, §4.2, Table 4.
  • [44] L. Xie and A. Yuille (2017) Genetic cnn. In ICCV, pp. 1379–1388. Cited by: §2.3.
  • [45] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In ECCV, pp. 305–321. Cited by: §1, §1, §2.1, §3.1, §4.3, Table 4.
  • [46] Y. Xu, L. Xie, X. Zhang, X. Chen, G. Qi, Q. Tian, and H. Xiong (2020) Pc-darts: partial channel connections for memory-efficient differentiable architecture search. ICLR. Cited by: §2.3.
  • [47] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille (2018)

    Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation

    In CVPR, pp. 8280–8289. Cited by: §4.1, Table 2.
  • [48] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille (2018-06) Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation. In CVPR, Cited by: §4.1.
  • [49] Q. Yu, D. Yang, H. Roth, Y. Bai, Y. Zhang, A. L. Yuille, and D. Xu (2020) C2FNAS: coarse-to-fine neural architecture search for 3d medical image segmentation. CVPR. Cited by: §2.3, §4.1, CAKES: Channel-wise Automatic KErnel Shrinking for Efficient 3D Network.
  • [50] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In ECCV, pp. 803–818. Cited by: Table 4.
  • [51] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille (2017) A fixed-point model for pancreas segmentation in abdominal ct scans. In MICCAI, pp. 693–701. Cited by: §4.1, §4.1, Table 2.
  • [52] Z. Zhu, C. Liu, D. Yang, A. Yuille, and D. Xu (2019) V-nas: neural architecture search for volumetric medical image segmentation. 3DV. Cited by: §2.3, §4.1, §4.1, Table 2.
  • [53] Z. Zhu, Y. Xia, W. Shen, E. K. Fishman, and A. L. Yuille (2018) A 3d coarse-to-fine framework for automatic pancreas segmentation. 3DV. Cited by: §4.1, §4.1, Table 2.
  • [54] M. Zolfaghari, K. Singh, and T. Brox (2018) Eco: efficient convolutional network for online video understanding. In ECCV, pp. 695–712. Cited by: §4.2, Table 4.
  • [55] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. ICLR. Cited by: §2.3.
  • [56] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, pp. 8697–8710. Cited by: §2.3.