1 Introduction
3D learning has attracted more and more research attention with the recent advance of deep neural networks. However, conventional 3D convolution layers typically result in expensive computation and suffer from convergence problems due to overfitting issues and lack of pretrained weights [4, 37].
To resolve the redundancy in 3D convolutions, many efforts have been investigated to design efficient alternatives. For instance, [31, 40] propose to factorize the 3D kernel and replace the 3D convolution with P3D and (2+1)D convolution, where 2D and 1D convolution layers are applied in a structured manner. Xie et al. [45] suggest that replacing 3D convolutions with lowcost 2D convolutions at the bottom of the network significantly improves recognition efficiency.
Despite their effectiveness for spatialtemporal information extraction, there are several limitations of existing 3D alternatives. Firstly, these methods (e.g., P3D) are specifically tailored to video datasets, where data can be explicitly separated into time and space. However, for volumetric data such as CT/MRI where all three dimensions should be treated equally, conventional spatialtemporal operators can lead to biased information extraction. Secondly, existing operations are still insufficient even for spatialtemporal data since they may exhibit certain levels of redundancy either along the temporal or the spatial dimension, as empirically suggested in [45]. Finally, existing replacements are manually designed, which can be timeconsuming and may not lead to optimal results.
To address these issues, we introduce Channelwise Automatic KErnel Shrinking (CAKES), as a general framework to automatically determine an efficient replacement for existing 3D operations. Specifically, the proposed method simplifies conventional 3D operations by adopting a combination of diverse and economic operations (e.g., 1D, 2D convolutions), where these different operators can extract complementary information to be utilized in the same layer. Unlike previous methods, by shrinking standard 3D kernels in a channelwise fashion, our approach is not tailored to any specific type of input (e.g., videos), but can be generalized to different types of data and backbone architectures to learn a finegrained and efficient replacement. Moreover, we provide a new perspective—to formulate kernel shrinking as a pathlevel selection problem, which could then be solved by Neural Architecture Search (NAS). To accelerate the search process for an optimal replacement configuration given the tremendous search space, we relax the selection of operations to be differentiable, so that the replacement can be determined in oneshot manner during the endtoend training.
The proposed search algorithm delivers highperformance and efficient models. As shown in Fig. 1, evaluated on both 3D medical image segmentation and video action recognition tasks, our method achieves 64.15% average dice score on the MSD pancreas dataset with 9.72M parameters and 87.16G FLOPs, and 47.4% top1 accuracy on somethingsomething V1 dataset with 37.5M parameters and 50.7G FLOPs. Compared with its 3D baseline, CAKES not only shows superior performance but also reduces the model size (56.80% less on medical and 19.35% less on video) and computational cost (53.76% less on medical and 19.01% less on video) significantly. The proposed method surpasses their 2D/3D/P3D counterparts and achieves stateoftheart performance.
Our contributions can be summarized into four folds:
(1) We propose a novel method to shrink 3D kernels into heterogeneous yet complementary efficient counterparts at a finegrained level, which leads to more efficient and flexible alternatives to 3D convolution without sacrificing accuracy.
(2) We propose channelwise kernel shrinkage, which yields a generic and flexible replacement that can be applied to both spatialtemporal and volumetric data with different backbone architectures.
(3) By formulating kernel shrinkage as a pathlevel selection problem and relaxing the selection of operations to be differentiable, the replacement can be determined in oneshot manner using NAS as the search platform.
(4) By applying CAKES to different backbone models, we achieve stateoftheart performance while being much more efficient on both volumetric medical data and video data compared with their 2D/3D/P3D counterparts.
2 Related Work
2.1 Efficient 3D Convolutional Neural Networks
Despite the great advances of 3D CNNs [4, 7, 38], existing 3D networks usually require heavy computational budget. Besides, 3D CNNs also suffer from unstable training due to lack of pretrained weights [4, 21, 37]. These facts have motivated researchers to find efficient alternatives to 3D convolutions. For example, it is suggested in [25, 39] to apply group convolution [17] and depthwise convolution [6] to 3D networks to obtain resourceefficient models. Another type of approach suggests replacing each 3D convolution layer with a structured combination of 2D and 1D convolution layers to achieve better performance while being more efficient. For instance, [31, 40] propose to use a 2D spatial convolution layer followed by a 1D temporal convolution layer to replace a standard 3D convolution layer. Besides, Xie et al. [45] demonstrate that 3D convolutions are not needed everywhere and some of them can be replaced by 2D counterparts. Similar attempts also occur in the medical imaging area [22]. Gonda et al. [9] try to replace consecutive 3D convolution layers through consecutive 2D convolution layers followed by a 1D convolution layer.
Our method differs from these methods by the following folds: (1) Instead of applying homogeneous operations to all channels, we allow assigning complementary heterogeneous operations at channelwise; and (2) Instead of manual design through trialanderror, we design a neural architecture search method to automatically optimize the replacement configuration.
2.2 Network Pruning
Network pruning methods investigate the redundancy in deep models by selecting the important connections (according to a certain criterion) to obtain more compact models. Based on the granularity level, network pruning methods can be divided into the following categories: 1) unstructured weight pruning, which aims at identifying and removing unimportant weights inside networks. For instance, Han et al. [13, 14] propose to prune network weights with small magnitude. These methods typically remove weights and connections in an unstructured manner, making it hard to have real speedup without dedicated hardware or libraries; 2) structured pruning methods, which prune at the level of channel or even layers [16, 23, 26], and achieve empirical acceleration on modern computing devices.
In this paper, we study structured kernel shrinking—to shrink a 3D convolution kernel to its optimal subkernel. This direction has been rarely studied before. We note our method is similar to weight pruning but unlike previous methods, we perform it in a structured manner, so that the deployment of the shrunk kernels can lead to efficient models.
2.3 Neural Architecture Search
Neural Architecture Search (NAS) aims at automatically discovering better network architectures than humandesigned ones. It has been proved successful not only in 2D natural image recognition task [55], but also on other tasks such as segmentation [19] and detection [8]. Besides the success on natural image, there are also some trials on other data formats such as video [34] and 3D medical image [49, 52]
. Earlier NAS algorithms are based on either reinforcement learning
[1, 55, 56][32, 44]. However, these methods often require training each network candidate from scratch and the intensive computational costs hamper its usage especially with limited computational budget. Since parameter sharing scheme was proposed in [30], more and more search methods such as differentiable NAS approaches [5, 20, 46] and oneshot NAS approaches [3, 12, 36] began to investigate how to effectively reduce the search cost to several GPU days or even several GPU hours.Moreover, [24] have discussed the relationship between network pruning and NAS, and [10, 27] successfully borrow ideas from network pruning and design more efficient search methods. Some NAS methods [27, 36] also incorporate the kernel size into the search space. Nevertheless, most of them only consider simple cases with choices among , , , while we consider much more diverse and general kernel deployment across different channels in 3D settings.
3 Method
3.1 Revisit variants of 3D convolution
We first revisit 3D convolutions and existing alternatives. Without loss of generality, let X of size
denotes the input tensor, where
stands for the input channel number, and , , represent the spatial depth (or temporal length), the spatial height, and the spatial width, respectively. The weights of the corresponding 3D kernel are denoted as , where is the output channel number and denote the kernel size. Therefore, the output tensor Y of shape can be derived as following:(1) 
where denotes convolution, is the output channel index, i.e., .
As Eqn. (1) suggests, the computation overhead of 3D convolutions is significantly heavier than 2D counterparts. As a consequence, the expensive computation and overparameterization induced by 3D deep networks impede the scalability of network capacity. Recently, there are many works seeking to alleviate the high demand of 3D convolutions. One common strategy is to decouple the spatial and temporal components [31, 40]. The underlying assumption here is that the spatial and temporal kernels are orthogonal to each other, and therefore can effectively extract complementary information from different dimensions. Another option is to discard 3D convolutions and simply use 2D operations instead [45]. Mathematically speaking, these replacements can be written as:
(2) 
(3) 
where indicates the replacement operation. Similar ideas also occur in 3D medical image analysis, where the images are volumetric data. For instance, it is shown in [21] that using 2D convolutions in encoder and replacing 3D convolutions with P3D operations in decoder not only largely reduce the computation overhead but also improve the performance over the traditional 3D networks.
Though these methods have furthered the model efficiency compared with standard 3D convolutions, there are several limitations to be tackled. On the one hand, as shown in Eqn. (2), decomposing the kernels into orthogonal 2D and 1D components is specifically designed to extract spatialtemporal information, which may not well generalize to volumetric data. On the other hand, directly replacing 3D kernels with 2D operators (Eqn. (3)) cannot effectively capture information along the third dimension. To address these issues, we propose Channelwise Automatic KErnel Shrinking (CAKES), to automatically search for an efficient 3D replacement which can deal with any general type of input. Our core idea is to shrink standard 3D kernels into a set of cheaper 1D, 2D, and 3D components. Besides, the shrunk kernels are channelspecific, which we refer to as Channelwise Shrinkage. Next we will describe how to formulate kernel shrinking to pathlevel selection [20] in Sec. 3.2. The details of channelwise shrinkage and search method will be elaborated in Sec. 3.3 and Sec. 3.4.
3.2 Kernel Shrinking as Pathlevel Selection
Let’s consider the case for only one channel, and abbreviate to for simplicity. We aim to find the optimal subkernel (,,) as the substitute for 3D kernel . Therefore, the original 3D kernels can be effectively reduced to smaller subkernels, leading to a more flexible and efficient design.
Given a 3D kernel , a natural design choice for kernel shrinking is to represent it as the summation of its subkernel and the remainder:
(4) 
where we denote the index set of within the original 3D convolution as . The remainder after removing subkernel is . is a index tensor, and is the indicator function. If the following inequality holds ( is a small constant):
(5) 
where the lefthand side can be deemed as an importance measure with as the norm constant, which is similar to unstructured weight pruning [14], yet here we aim for a structured subkernel, then we can remove the remainder and reduce the 3D kernel to its subkernel as a replacement:
(6) 
However, as in Fig. 2(a), even considering only kernel sizes, there are subkernel options for a 3D kernel, which makes it impractical to find the optimal subkernel via manual designs.
Therefore, we provide a new perspective—to formulate this problem as pathlevel selection [20], i.e., to encode subkernels into a multipath supernetwork and select the optimal path among them (Fig. 2(c)). Then this problem can be solved by neural architecture search algorithms.
We first represent a 3D kernel as follows (Fig. 2(b)):
(7) 
where is the weight of th subkernel , , , . With this formulation, the problem of finding the optimal subkernel of becomes finding an optimal set of and then keeping the subkernel with maximum . Due to the linearity of convolution, Eqn. (1) can then be derived as below:
(8) 
To solve for the path weights , we reformulate Eqn. (8) as an overparameterized multipath supernetwork, where each candidate path consists of a subkernel (Fig. 2). By relaxing the selection space, i.e., relaxing the conditions on to be continuous, Eqn. (8) can be then formulated as a differential NAS problem and optimized via gradient descent [20].
3.3 Channelwise Shrinkage
While previous replacements [21, 31, 40] consist of homogeneous operations in the same layer, we argue that a more efficient replacement requires customized operations at each channel. As shown in Fig. 3, kernel shrinking in a channelwise fashion can generate heterogeneous operations which extract diverse and complementary information to be employed in the same layer, and thereby yields a finegrained and thus more efficient replacement (Fig. 3(d)) than prior methods which use layerwise replacements (Fig. 3(a) & (b) & (c)).
Contrary to previous layerwise replacement, our core idea is to replace 3D kernel at each channel individually, thus the target is to find the optimal subkernel as the substitute for the th 3D kernel :
(9) 
where the optimal size of the subkernel () is subjected to , , . Hence the computation incurred by Eqn. (1) can be largely reduced by our replacement as above.
With our channelwise replacement design, the original 3D kernels are substituted by a series of diverse and cheaper operations at different channels as following (recall that is the output channel number):
(10) 
Benefited from channelwise shrinkage, our method provides a more general and flexible design for replacing 3D convolution than previous approaches (Eqn. (2) and Eqn. (3)), where it can also be easily reduced to arbitrary alternatives (, 2D, P3D) by integrating these operations into the set of candidate subkernels. An illustration example can be found in Fig 3.
3.4 Neural Architecture Search for an Efficient Replacement
As aforementioned, given the tremendous search space, it is impractical to manually find the optimal replacement for a 3D kernel through a trialanderror process. Especially, it becomes even more intractable as the replacement procedure is conducted in a channelwise manner. Therefore, we propose to automate the process of learning an efficient replacement to fully exploit the redundancies in 3D convolution operations. By formulating kernel shrinkage as a pathlevel selection problem, we first construct a supernetwork where every candidate subkernel is encapsulated into a separate trainable branch (Fig. 2(c)) at each channel. Once the path weights are learned by differentiable NAS [20], the optimal path (subkernel) can be determined.
Search Space. A welldesigned search space is crucial for NAS algorithms. We aim to answer the following questions: Should the 3D convolution kernel be kept or replaced per channel? If replaced, which operation should be deployed instead?
To address these questions, for each channel, we define a set , which contains all candidates of subkernels (replacement) given a 3D kernel :
(11) 
As the original 3D convolution kernel is a subkernel of itself, i.e., , it can be kept in the final configuration. The final optimal operation is chosen among .
Another critical problem for NAS is how to reduce the search cost. To make the search cost affordable, we adopt a differentiable NAS paradigm where the model structure is discovered in singlepass supernetwork training. Drawing inspirations from previous NAS methods, we directly use the scaling parameters in the normalization layers as the path weights of the multipath supernetwork (Eqn. (8)) [10, 27]. And our goal is then equivalent to finding the optimal subnetwork architecture based on the learned path weights. To achieve this goal, we introduce two different search algorithms which aim at either maximizing the performance or optimizing the computation cost of the subnetwork as a search priority, named as performancepriority and costpriority search, respectively.
PerformancePriority Search. As the title implies, the search is performed in a “performancepriority” manner, which means to maximize the performance by finding the optimal subkernels given the backbone architecture. During the search procedure, following [2, 3], we randomly pick an operation for each channel at each training iteration. This not only allows for memory saving by only activating and updating one path per iteration but also propels the weights of the paths in the supernetwork training to be decoupled. After the supernetwork is trained, the operation with the largest path weight will be picked as the final choice for the given output channel:
(12) 
CostPriority Search. Performancepriority search regards the performance as the search priority and may neglect the possible negative effects on the computation cost. In order to obtain more compact models, we introduce a “costpriority” search method. Following [27], the output of each subkernels are concatenated and aggregated by the following convolution. To make the searched architecture more compact, we introduce a “costaware” penalty term—A lasso term on which is used as the penalty loss to push many path weights to nearzero values. Therefore, the total training loss can be written as:
(13) 
where is a “costaware” term to balance the penalty term, which is proportional to the parameters or FLOPs cost of the subkernel. In Table 1, we also empirically show that this term can lead to a more efficient architecture. The introduction of aims at giving more penalty to “expensive” operations and leading to a more efficient replacement. is the coefficient of the penalty term, and is the conventional training loss (, crossentropy loss combined with the regularization term such as weight decay).
4 Experiments
4.1 3D Medical Image Segmentation
Dataset. We evaluate the proposed method on two public datasets: 1) Pancreas Tumours dataset from the Medical Segmentation Decathlon Challenge (MSD) [35], which contains 282 cases with both pancreatic tumours and normal pancreas annotations; and 2) NIH Pancreas Segmentation dataset [33], consisting of 82 abdominal CT volumes. For the MSD dataset, we use 226 cases for training and evaluate the segmentation performance on the rest 56 cases. The resolution along the axial axis of this dataset is extremely low and the number of slices can be as small as 37. For data preprocessing, all images are resampled to an isotropic 1.0 resolution. For the NIH dataset, the resolution of each scan is , where is the number of slices along the axial axis and the voxel spacing ranges from 0.5 to 1.0 . We test the model in a 4fold crossvalidation manner following previous methods [51].
Methods  Type  Params (M)  FLOPs (G)  Pancreas DSC (%)  Tumor DSC (%)  Average DSC (%) 
manual  11.29  97.77  79.16  43.02  61.09  
manual  22.50  188.48  80.34  47.57  63.96  
manual  13.16  112.88  45.27  62.82  
manual  7.56  67.53  79.77  42.73  61.25  
manual  11.29  97.77  80.09  46.17  63.13  
manual  11.41  99.17  79.82  45.27  62.55  
auto  80.32  45.57  62.95  
auto  11.29  97.77  80.05  48.51  64.28  
auto  11.26  99.68  80.12  
auto  9.72  87.16  80.34  47.95  64.15 
Implementation Details. For all experiments, C2FNAS [49] is used as the backbone architecture and the search is performed in “performancepriority” manner unless otherwise specified. When replacing the operations, we keep the stem (the first two and the last two convolution layers) as the same. For 3D medical image, for simplicity, we choose a set of most representative subkernels as . The operations set contains conv1D (, , ), conv2D (, , ) from different directions, and conv3D (). For every 3D kernel at each output channel, a subkernel from will be chosen as the replacement. For NAS settings, we include both “performancepriority” and “costpriority” search for performance comparison. For manual settings, we assign all candidates operations uniformly across the output channels.
Training stage. For the MSD dataset, we use random crop with patch size of , random rotation (, , , and
) and flip in all three axes as data augmentation. The batch size is 8 with 4 GPUs. We use SGD optimizer with learning rate starting from 0.01 with polynomial decay of power of 0.9, momentum of 0.9, and weight decay of 0.00004. The loss function is the summation of Dice loss
[28] and crossentropy loss. For NIH dataset, the patch size is set as following [52]. The found architecture will be trained from scratch to ensure its effectiveness. Both the supernetwork and the found architecture are trained under the same settings as aforementioned. For search stage with “costpriority” setting, a lasso term with coefficient is applied to the path weights, which is further reweighted by for 3D, 2D, 1D operations respectively. After training finishes, the operation with the largest is chosen as the final replacement for 3D operation for each channel.Testing stage. We test the network in a slidingwindow manner, where the patch size is
and stride is
for the MSD dataset and patch size is and stride is for NIH dataset. The result is measured with DiceSørensen coefficient (DSC) metric, which is formulated as , where and denote the prediction and groundtruth voxels set for a foreground class. The DSC has a range of with 1 implying a perfect prediction.Method  Type  Params  Average DSC  Max DSC  Min DSC 
Zhou et al. [51]  Manual  268.56M  82.37%  90.85%  62.43% 
Oktay et al. [29]  Manual  103.88M  83.10%     
Yu et al. [47]  Manual  268.56M  84.50%  91.02%  62.81% 
Zhu et al. [53]  Manual  20.06M  84.59%  91.45%  69.62% 
Zhu et al. [52]  Auto  29.74M  85.15%  91.18%  70.37% 
Auto  84.85%  91.61%  59.32%  
Auto  11.26M 
NAS Settings vs. Manual Settings. As can be observed from Table 1, even under manual settings, CAKES is already more efficient with slightly inferior performance (, from to manual , parameters drop from 22.50M to 11.29M, and FLOPs drop from 188.48G to 97.77G, with performance gap of 1.0%). Besides, under the manual settings outperforms its counterpart with standard convolution layers () by more than with the same model size, which indicates the benefits of our design. In addition, under NAS settings, the proposed search method can further reduce the performance gap and even surpasses original 3D model with much fewer parameters and computations, e.g., model size is reduced from 22.50M () to 11.26M (), and FLOPs drop from 188.48G () to 99.68G (), with a performance improvement of 0.46%. Compared with , also yields superior performance (+1.60%) with a more compact model (11.26M vs. 13.16M), which further indicates the effectiveness of the proposed method.
Influence of the Search Space. From Table 1, we can see that using different search space, our proposed CAKES consistently outperforms its counterpart with standard 1D/2D/3D convolutions. For instance, compared with and , our CAKES leads to a significant performance improvement (+1.70% for , +1.15% for ) with same model size and computation cost. Out of different search spaces, we find that (7.56M params and 67.53G FLOPs) offers the most efficient model with slightly worse performance, while (11.29M params and 97.77G FLOPs) can already surpass the 3D baseline (22.50M params and 188.48G FLOPs) with half parameters and computation cost. After we enlarge the search space, successfully finds a configuration with even higher performance/efficiency (last 2 rows of Table 1).
Generalization to different backbone architectures. We also test our method on different backbone architectures. Applying to another stateoftheart model 3D ResDSN [53], our method consistently leads to a more efficient model with much fewer parameters (10.03M to 4.63M) and FLOPs (192.07G to 98.12G) with comparable performance (61.96% to 61.65%).
NIH Results. We compare the proposed CAKES with stateoftheart methods in Table 2, where it can be observed that the proposed CAKES leads to a much more compact model size compared to other alternatives. for instance, our model size is more than smaller than [51] and [47]. It is well worth noting that our model even performed in a singlestage fashion already outperforms many stateoftheart methods conducted in a twostage coarsetofine manner [51, 48, 53] on the NIH pancreas dataset with much fewer model parameters and FLOPS. It is also noteworthy to mention that the applied architecture is searched from another dataset (MSD), where the images are collected under different protocols and have different resolutions. This result indicates the generalization of our searched model. By directly applying the architecture searched on the MSD dataset, our method also outperforms [52] which was directly searched on the NIH dataset with less than parameters of [52].
4.2 Action Recognition in Videos
Dataset. SomethingSomething V1 [11] is a large scale action recognition dataset which requires comprehensive temporal modeling. There are totally about 110k videos for 174 classes with diverse objects, backgrounds, and viewpoints.
Implementation Details. We adopt ResNet50 [15]
with pretrained weight on ImageNet
[17] as our backbone. The 3D convolution weights are initialized by repeating 2D kernel by 3 times along the temporal dimension following [4], while 1D convolution weights are initialized by averaging the 2D kernel on spatial dimensions and then repeat by 3 times along temporal axis. For the temporal dimension, we use the sparse sampling method as in TSN [41]. And for spatial dimension, the short side of the input frames are resized to 256 and then cropped to .Training stage.
We use random cropping and flipping as data augmentation. We train the network with a batch size of 96 on 8 GPUs with SGD optimizer. The learning rate starts from 0.04 for the first 50 epochs and decays by a factor of 10 for every 10 epochs afterwards. The total training epochs are 70. We also set dropout ratio to 0.3 following
[43]. The training settings remain the same for both final network and search stage, except that when searching with “costpriority”, we use a lasso term with and for 3D, 2D, 1D operations respectively.Testing stage. we sample the middle frame in each segment and do center crop for each frame. We report the results of single crop, unless otherwise specified.
Model  Type  Params (M)  FLOPs (G)  top1  top5 
C2D  manual  23.9  33.0  17.2  43.1 
P3D  manual  27.6  37.9  44.8  74.6 
C3D  manual  46.5  62.6  46.8  75.3 
manual  46.2  75.2  
manual  35.2  47.7  46.8  76.0  
auto  21.2  29.5  46.3  74.9  
auto  37.5  50.7  
auto  33.5  43.9  47.2  75.7  
auto  20.5  29.3  46.8  76.0  
auto  35.7  41.4  46.9  75.6  
auto  35.0  38.7  46.9  75.5 
Model  Backbone  #Frame  FLOPs  #Param.  top1  top5 

TSN [41]  ResNet50  8  33G  24.3M  19.7  46.6 
TRN2stream [50]  BNInception  8+8    36.6M  42.0   
ECO [54]  BNIncep+3D Res18  8  32G  47.5M  39.6   
ECO [54]  BNIncep+3D Res18  16  64G  47.5M  41.4   
[54]  BNIncep+3D Res18  92  267G  150M  46.4   
I3D [4]  3D ResNet50  322clip  153G2  28.0M  41.6  72.2 
NL I3D [42]  3D ResNet50  322clip  168G2  35.3M  44.4  76.0 
NL I3D+GCN [43]  3D ResNet50+GCN  322clip  303G2  62.2M  46.1  76.8 
TSM [18]  ResNet50  8  33G  24.3M  45.6  74.2 
TSM [18]  ResNet50  16  65G  24.3M  47.2  77.1 
S3D [45]  BNInception  64  66.38G    47.3  78.1 
S3DG [45]  BNInception  64  71.38G    48.2  78.7 
ResNet50  8  29.3G  20.5M  46.8  76.0  
ResNet50  8  50.7G  37.5M  47.4  76.1  
ResNet50  8  43.9G  33.5M  47.2  75.7  

ResNet50  16  58.6G  20.5M  48.0  78.0 
ResNet50  16  101.4G  37.5M  48.6  78.6  
ResNet50  16  87.8G  33.5M  49.4  78.4 
Ablation Study. We study the impacts of both different operations set and manual/auto configurations. The results are summarized in Table 3. Considering the spatialtemporal property of video data, we study the following different operations set: (1) Spatial 2D convolution and temporal 1D convolution; (2) Spatial 2D convolution and 3D convolution; (3) Spatial 2D, temporal 1D, and 3D convolutions.
Operation set with 1D & 2D subkernels. As shown in Table 3, surpass the 2D baseline by a large margin (46.8% vs. 17.2%), while having a smaller model size (20.5M vs. 23.9M). This suggests that TSN [41] may lack the ability to capture temporal information, therefore replacing some of the 2D operations to temporal 1D operations can significantly increase the performance and reduce the model size. Besides, it also surpasses P3D, where each 2D convolution is followed by a temporal 1D convolution, with a significant advantage on both performance (46.8% vs. 44.8%) and model cost (20.5M params and 29.3G FLOPs compared to 27.6M params and 62.6G FLOPs), indicating makes better use of redundancies in the network than P3D. Therefore, using operation set containing 1D and 2D subkernels can be an ideal design when looking for efficient video understanding networks.
Operation set with 2D & 3D subkernels. We aim to see how does balance the tradeoff between performance and model cost. From Table 3, under the “costpriority” setting, yields a much more compact model (35.7M and 41.4G FLOPs) with a comparable performance to C3D. When it comes to the “performancepriority” setting, searches a slightly larger model (37.5M params and 50.7G FLOPs), yet its performance boosts significantly to 47.4%.
Operation set with 1D & 2D & 3D subkernels. Compared to , shows a slightly inferior performance (0.5%) with much fewer FLOPs (38.7G vs. 50.7G). Besides, under the “performancepriority” setting, produces a comparable performance to with much less computation cost (43.9G vs. 50.7G). This result indicates that with a more general search space (, 1D, 2D, and 3D), the proposed CAKES can find more flexible designs, which lead to better performance/efficiency.
Results. A comparison with other stateoftheart methods is shown in Table 4. We report the model performance under both 8frame and 16frame settings. Compared with other stateoftheart methods, sampling only 8 frames can already outperform most current methods. With smaller parameters and FLOPs, surpasses those complex models such as nonlocal networks [42] with graph convolution [43]. Comparing to other efficient video understanding framework such as ECO [54] and TSM [18], our model is not only more lightweight (58.6G vs. 64G/65G), but also delivers a better performance (48.0% vs. 41.4%/47.2%). And our best model achieves a new stateoftheart performance of 49.4% top1 accuracy with a moderate model size. An interesting finding is that although shows similar performances to with 8frame inputs, it achieves a much higher accuracy when it comes to the 16frame scenario, which demonstrates that with a more general search space, also shows a stronger transferability than other counterparts.
4.3 Analysis
CostPriority Architectures. We plot the found architecture on both medical data () and video data () respectively in Fig 4. For the architecture searched on somethingsomething dataset, we note that the algorithm prefers efficient 2D operations at the bottom of the network, and favors 3D operation at the top of the network. This implies that the search algorithm successfully find that temporal information extracted from highlevel features is more useful, which coincides with the observation in [45]. For the architecture found on the MSD dataset, we calculated the number of operations computed on all three data dimensions, and numbers are (, , ). This suggests that the searched model, unlike the searched network for videos, tends to treat each dimension equally, which aligns with the property of volumetric medical data. In addition, the number of 1D, 2D, 3D operations are , , and respectively, indicating that the efficient 1D/2D operations are more preferred.
PerformancePriority Architectures. As shown in Fig. 4(b), for , the pure temporal subkernel () is rarely chosen at the top of the network, while it plays a more important role as the network goes deeper. This observation agrees with our previous finding: temporal information extracted from highlevel features is more useful. For on medical image as shown in Fig. 4(a), we calculate the numbers of each type of subkernels and computation on three axes. The numbers of 1D, 2D, and 3D subkernel are , , and the number of operations computed on all three data dimensions are , , , which again coincides with our previous finding that tends to treat each dimension equally for the symmetric volumetric data. Compared to costpriority , we notice that performancepriority favors the operation set with more 3D subkernels, which can provide a larger model capacity.
5 Conclusions
As an important solution to various 3D vision applications, 3D networks still suffer from overparameterization and heavy computations. How to design efficient alternatives to 3D operations remain an open problem. In this paper, we propose Channelwise Automatic KErnel Shrinking (CAKES), where standard 3D convolution kernels are shrunk into efficient subkernels at channellevel to obtain efficient 3D models. Besides, by formulating kernel shrinkage as a pathlevel selection problem, our method can automatically explore the redundancies in 3D convolutions and optimize the replacement configuration. By applying on different backbone models, the proposed CAKES significantly outperforms previous 2D/3D/P3D and other stateoftheart methods on both 3D medical image segmentation and action recognition from videos.
References
 [1] (2017) Designing neural network architectures using reinforcement learning. ICLR. Cited by: §2.3.
 [2] (2018) Understanding and simplifying oneshot architecture search. In ICML, pp. 550–559. Cited by: §3.4.
 [3] (2018) SMASH: oneshot model architecture search through hypernetworks. ICLR. Cited by: §2.3, §3.4.
 [4] (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308. Cited by: §1, §2.1, §4.2, Table 4.
 [5] (2019) Progressive differentiable architecture search: bridging the depth gap between search and evaluation. ICCV. Cited by: §2.3.

[6]
(2017)
Xception: deep learning with depthwise separable convolutions
. In CVPR, pp. 1251–1258. Cited by: §2.1.  [7] (2016) 3D unet: learning dense volumetric segmentation from sparse annotation. In MICCAI, Cited by: §2.1.
 [8] (2019) Nasfpn: learning scalable feature pyramid architecture for object detection. In CVPR, pp. 7036–7045. Cited by: §2.3.
 [9] (2018) Parallel separable 3d convolution for video and volumetric data understanding. BMVC. Cited by: §2.1.
 [10] (2018) Morphnet: fast & simple resourceconstrained structure learning of deep networks. In CVPR, pp. 1586–1595. Cited by: §2.3, §3.4.
 [11] (2017) The” something something” video database for learning and evaluating visual common sense.. In ICCV, Vol. 1, pp. 5. Cited by: §4.2.
 [12] (2019) Single path oneshot neural architecture search with uniform sampling. arXiv preprint arXiv:1904.00420. Cited by: §2.3.
 [13] (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR. Cited by: §2.2.
 [14] (2015) Learning both weights and connections for efficient neural network. In NeurIPs, pp. 1135–1143. Cited by: §2.2, §3.2.
 [15] (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: Figure 3, §4.2, CAKES: Channelwise Automatic KErnel Shrinking for Efficient 3D Network.
 [16] (2017) Channel pruning for accelerating very deep neural networks. In ICCV, pp. 1389–1397. Cited by: §2.2.
 [17] (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, pp. 1097–1105. Cited by: §2.1, §4.2.
 [18] (2019) Tsm: temporal shift module for efficient video understanding. In ICCV, pp. 7083–7093. Cited by: §4.2, Table 4.
 [19] (2019) Autodeeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, pp. 82–92. Cited by: §2.3.
 [20] (2019) Darts: differentiable architecture search. ICLR. Cited by: §2.3, §3.1, §3.2, §3.2, §3.4.
 [21] (2017) 3D anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. MICCAI. Cited by: §2.1, §3.1, §3.3.
 [22] (2018) 3d anisotropic hybrid network: transferring convolutional features from 2d images to 3d anisotropic volumes. In MICCAI, pp. 851–858. Cited by: §2.1.
 [23] (2017) Learning efficient convolutional networks through network slimming. In ICCV, pp. 2736–2744. Cited by: §2.2.
 [24] (2019) Rethinking the value of network pruning. ICLR. Cited by: §2.3.
 [25] (2019) Grouped spatialtemporal aggretation for efficient action recognition. In ICCV, Cited by: §2.1.
 [26] (2017) Thinet: a filter level pruning method for deep neural network compression. In ICCV, pp. 5058–5066. Cited by: §2.2.
 [27] (2020) AtomNAS: finegrained endtoend neural architecture search. ICLR. Cited by: §2.3, §3.4, §3.4.
 [28] (2016) Vnet: fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pp. 565–571. Cited by: §4.1.
 [29] (2018) Attention unet: learning where to look for the pancreas. MIDL. Cited by: Table 2.
 [30] (2018) Efficient neural architecture search via parameter sharing. ICML. Cited by: §2.3.
 [31] (2017) Learning spatiotemporal representation with pseudo3d residual networks. In ICCV, pp. 5533–5541. Cited by: §1, §2.1, §3.1, §3.3.

[32]
(2019)
Regularized evolution for image classifier architecture search
. In AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.3.  [33] (2015) Deeporgan: multilevel deep convolutional networks for automated pancreas segmentation. In MICCAI, Cited by: §4.1.
 [34] (2019) Assemblenet: searching for multistream neural connectivity in video architectures. arXiv preprint arXiv:1905.13209. Cited by: §2.3.
 [35] (2019) A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063. Cited by: §4.1.
 [36] (2019) Singlepath nas: designing hardwareefficient convnets in less than 4 hours. arXiv preprint arXiv:1904.02877. Cited by: §2.3, §2.3.
 [37] (2016) Convolutional neural networks for medical image analysis: full training or fine tuning?. TMI 35 (5), pp. 1299–1312. Cited by: §1, §2.1.
 [38] (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: §2.1.
 [39] (2019) Video classification with channelseparated convolutional networks. In ICCV, pp. 5552–5561. Cited by: §2.1.
 [40] (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, pp. 6450–6459. Cited by: §1, §2.1, §3.1, §3.3.
 [41] (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, pp. 20–36. Cited by: §4.2, §4.2, Table 4.
 [42] (2018) Nonlocal neural networks. In CVPR, pp. 7794–7803. Cited by: §4.2, Table 4.
 [43] (2018) Videos as spacetime region graphs. In ECCV, pp. 399–417. Cited by: §4.2, §4.2, Table 4.
 [44] (2017) Genetic cnn. In ICCV, pp. 1379–1388. Cited by: §2.3.
 [45] (2018) Rethinking spatiotemporal feature learning: speedaccuracy tradeoffs in video classification. In ECCV, pp. 305–321. Cited by: §1, §1, §2.1, §3.1, §4.3, Table 4.
 [46] (2020) Pcdarts: partial channel connections for memoryefficient differentiable architecture search. ICLR. Cited by: §2.3.

[47]
(2018)
Recurrent saliency transformation network: incorporating multistage visual cues for small organ segmentation
. In CVPR, pp. 8280–8289. Cited by: §4.1, Table 2.  [48] (201806) Recurrent saliency transformation network: incorporating multistage visual cues for small organ segmentation. In CVPR, Cited by: §4.1.
 [49] (2020) C2FNAS: coarsetofine neural architecture search for 3d medical image segmentation. CVPR. Cited by: §2.3, §4.1, CAKES: Channelwise Automatic KErnel Shrinking for Efficient 3D Network.
 [50] (2018) Temporal relational reasoning in videos. In ECCV, pp. 803–818. Cited by: Table 4.
 [51] (2017) A fixedpoint model for pancreas segmentation in abdominal ct scans. In MICCAI, pp. 693–701. Cited by: §4.1, §4.1, Table 2.
 [52] (2019) Vnas: neural architecture search for volumetric medical image segmentation. 3DV. Cited by: §2.3, §4.1, §4.1, Table 2.
 [53] (2018) A 3d coarsetofine framework for automatic pancreas segmentation. 3DV. Cited by: §4.1, §4.1, Table 2.
 [54] (2018) Eco: efficient convolutional network for online video understanding. In ECCV, pp. 695–712. Cited by: §4.2, Table 4.
 [55] (2017) Neural architecture search with reinforcement learning. ICLR. Cited by: §2.3.
 [56] (2018) Learning transferable architectures for scalable image recognition. In CVPR, pp. 8697–8710. Cited by: §2.3.
Comments
There are no comments yet.