This repository contains the code for our AAAI2021 paper CAKES: Channel-wise Automatic KErnel Shrinking for Efficient 3D Networks.
3D Convolution Neural Networks (CNNs) have been widely applied to 3D scene understanding, such as video analysis and volumetric image recognition. However, 3D networks can easily lead to over-parameterization which incurs expensive computation cost. In this paper, we propose Channel-wise Automatic KErnel Shrinking (CAKES), to enable efficient 3D learning by shrinking standard 3D convolutions into a set of economic operations (e.g., 1D, 2D convolutions). Unlike previous methods, our proposed CAKES performs channel-wise kernel shrinkage, which enjoys the following benefits: 1) encouraging operations deployed in every layer to be heterogeneous, so that they can extract diverse and complementary information to benefit the learning process; and 2) allowing for an efficient and flexible replacement design, which can be generalized to both spatial-temporal and volumetric data. Together with a neural architecture search framework, by applying CAKES to 3D C2FNAS and ResNet50, we achieve the state-of-the-art performance with much fewer parameters and computational costs on both 3D medical imaging segmentation and video action recognition.READ FULL TEXT VIEW PDF
This repository contains the code for our AAAI2021 paper CAKES: Channel-wise Automatic KErnel Shrinking for Efficient 3D Networks.
3D learning has attracted more and more research attention with the recent advance of deep neural networks. However, conventional 3D convolution layers typically result in expensive computation and suffer from convergence problems due to over-fitting issues and lack of pre-trained weights [4, 37].
To resolve the redundancy in 3D convolutions, many efforts have been investigated to design efficient alternatives. For instance, [31, 40] propose to factorize the 3D kernel and replace the 3D convolution with P3D and (2+1)D convolution, where 2D and 1D convolution layers are applied in a structured manner. Xie et al.  suggest that replacing 3D convolutions with low-cost 2D convolutions at the bottom of the network significantly improves recognition efficiency.
Despite their effectiveness for spatial-temporal information extraction, there are several limitations of existing 3D alternatives. Firstly, these methods (e.g., P3D) are specifically tailored to video datasets, where data can be explicitly separated into time and space. However, for volumetric data such as CT/MRI where all three dimensions should be treated equally, conventional spatial-temporal operators can lead to biased information extraction. Secondly, existing operations are still insufficient even for spatial-temporal data since they may exhibit certain levels of redundancy either along the temporal or the spatial dimension, as empirically suggested in . Finally, existing replacements are manually designed, which can be time-consuming and may not lead to optimal results.
To address these issues, we introduce Channel-wise Automatic KErnel Shrinking (CAKES), as a general framework to automatically determine an efficient replacement for existing 3D operations. Specifically, the proposed method simplifies conventional 3D operations by adopting a combination of diverse and economic operations (e.g., 1D, 2D convolutions), where these different operators can extract complementary information to be utilized in the same layer. Unlike previous methods, by shrinking standard 3D kernels in a channel-wise fashion, our approach is not tailored to any specific type of input (e.g., videos), but can be generalized to different types of data and backbone architectures to learn a fine-grained and efficient replacement. Moreover, we provide a new perspective—to formulate kernel shrinking as a path-level selection problem, which could then be solved by Neural Architecture Search (NAS). To accelerate the search process for an optimal replacement configuration given the tremendous search space, we relax the selection of operations to be differentiable, so that the replacement can be determined in one-shot manner during the end-to-end training.
The proposed search algorithm delivers high-performance and efficient models. As shown in Fig. 1, evaluated on both 3D medical image segmentation and video action recognition tasks, our method achieves 64.15% average dice score on the MSD pancreas dataset with 9.72M parameters and 87.16G FLOPs, and 47.4% top-1 accuracy on something-something V1 dataset with 37.5M parameters and 50.7G FLOPs. Compared with its 3D baseline, CAKES not only shows superior performance but also reduces the model size (56.80% less on medical and 19.35% less on video) and computational cost (53.76% less on medical and 19.01% less on video) significantly. The proposed method surpasses their 2D/3D/P3D counterparts and achieves state-of-the-art performance.
Our contributions can be summarized into four folds:
(1) We propose a novel method to shrink 3D kernels into heterogeneous yet complementary efficient counterparts at a fine-grained level, which leads to more efficient and flexible alternatives to 3D convolution without sacrificing accuracy.
(2) We propose channel-wise kernel shrinkage, which yields a generic and flexible replacement that can be applied to both spatial-temporal and volumetric data with different backbone architectures.
(3) By formulating kernel shrinkage as a path-level selection problem and relaxing the selection of operations to be differentiable, the replacement can be determined in one-shot manner using NAS as the search platform.
(4) By applying CAKES to different backbone models, we achieve state-of-the-art performance while being much more efficient on both volumetric medical data and video data compared with their 2D/3D/P3D counterparts.
Despite the great advances of 3D CNNs [4, 7, 38], existing 3D networks usually require heavy computational budget. Besides, 3D CNNs also suffer from unstable training due to lack of pre-trained weights [4, 21, 37]. These facts have motivated researchers to find efficient alternatives to 3D convolutions. For example, it is suggested in [25, 39] to apply group convolution  and depth-wise convolution  to 3D networks to obtain resource-efficient models. Another type of approach suggests replacing each 3D convolution layer with a structured combination of 2D and 1D convolution layers to achieve better performance while being more efficient. For instance, [31, 40] propose to use a 2D spatial convolution layer followed by a 1D temporal convolution layer to replace a standard 3D convolution layer. Besides, Xie et al.  demonstrate that 3D convolutions are not needed everywhere and some of them can be replaced by 2D counterparts. Similar attempts also occur in the medical imaging area . Gonda et al.  try to replace consecutive 3D convolution layers through consecutive 2D convolution layers followed by a 1D convolution layer.
Our method differs from these methods by the following folds: (1) Instead of applying homogeneous operations to all channels, we allow assigning complementary heterogeneous operations at channel-wise; and (2) Instead of manual design through trial-and-error, we design a neural architecture search method to automatically optimize the replacement configuration.
Network pruning methods investigate the redundancy in deep models by selecting the important connections (according to a certain criterion) to obtain more compact models. Based on the granularity level, network pruning methods can be divided into the following categories: 1) unstructured weight pruning, which aims at identifying and removing unimportant weights inside networks. For instance, Han et al. [13, 14] propose to prune network weights with small magnitude. These methods typically remove weights and connections in an unstructured manner, making it hard to have real speedup without dedicated hardware or libraries; 2) structured pruning methods, which prune at the level of channel or even layers [16, 23, 26], and achieve empirical acceleration on modern computing devices.
In this paper, we study structured kernel shrinking—to shrink a 3D convolution kernel to its optimal sub-kernel. This direction has been rarely studied before. We note our method is similar to weight pruning but unlike previous methods, we perform it in a structured manner, so that the deployment of the shrunk kernels can lead to efficient models.
Neural Architecture Search (NAS) aims at automatically discovering better network architectures than human-designed ones. It has been proved successful not only in 2D natural image recognition task , but also on other tasks such as segmentation  and detection . Besides the success on natural image, there are also some trials on other data formats such as video  and 3D medical image [49, 52]
. Earlier NAS algorithms are based on either reinforcement learning[1, 55, 56]32, 44]. However, these methods often require training each network candidate from scratch and the intensive computational costs hamper its usage especially with limited computational budget. Since parameter sharing scheme was proposed in , more and more search methods such as differentiable NAS approaches [5, 20, 46] and one-shot NAS approaches [3, 12, 36] began to investigate how to effectively reduce the search cost to several GPU days or even several GPU hours.
Moreover,  have discussed the relationship between network pruning and NAS, and [10, 27] successfully borrow ideas from network pruning and design more efficient search methods. Some NAS methods [27, 36] also incorporate the kernel size into the search space. Nevertheless, most of them only consider simple cases with choices among , , , while we consider much more diverse and general kernel deployment across different channels in 3D settings.
We first revisit 3D convolutions and existing alternatives. Without loss of generality, let X of size
denotes the input tensor, wherestands for the input channel number, and , , represent the spatial depth (or temporal length), the spatial height, and the spatial width, respectively. The weights of the corresponding 3D kernel are denoted as , where is the output channel number and denote the kernel size. Therefore, the output tensor Y of shape can be derived as following:
where denotes convolution, is the output channel index, i.e., .
As Eqn. (1) suggests, the computation overhead of 3D convolutions is significantly heavier than 2D counterparts. As a consequence, the expensive computation and over-parameterization induced by 3D deep networks impede the scalability of network capacity. Recently, there are many works seeking to alleviate the high demand of 3D convolutions. One common strategy is to decouple the spatial and temporal components [31, 40]. The underlying assumption here is that the spatial and temporal kernels are orthogonal to each other, and therefore can effectively extract complementary information from different dimensions. Another option is to discard 3D convolutions and simply use 2D operations instead . Mathematically speaking, these replacements can be written as:
where indicates the replacement operation. Similar ideas also occur in 3D medical image analysis, where the images are volumetric data. For instance, it is shown in  that using 2D convolutions in encoder and replacing 3D convolutions with P3D operations in decoder not only largely reduce the computation overhead but also improve the performance over the traditional 3D networks.
Though these methods have furthered the model efficiency compared with standard 3D convolutions, there are several limitations to be tackled. On the one hand, as shown in Eqn. (2), decomposing the kernels into orthogonal 2D and 1D components is specifically designed to extract spatial-temporal information, which may not well generalize to volumetric data. On the other hand, directly replacing 3D kernels with 2D operators (Eqn. (3)) cannot effectively capture information along the third dimension. To address these issues, we propose Channel-wise Automatic KErnel Shrinking (CAKES), to automatically search for an efficient 3D replacement which can deal with any general type of input. Our core idea is to shrink standard 3D kernels into a set of cheaper 1D, 2D, and 3D components. Besides, the shrunk kernels are channel-specific, which we refer to as Channel-wise Shrinkage. Next we will describe how to formulate kernel shrinking to path-level selection  in Sec. 3.2. The details of channel-wise shrinkage and search method will be elaborated in Sec. 3.3 and Sec. 3.4.
Let’s consider the case for only one channel, and abbreviate to for simplicity. We aim to find the optimal sub-kernel (,,) as the substitute for 3D kernel . Therefore, the original 3D kernels can be effectively reduced to smaller sub-kernels, leading to a more flexible and efficient design.
Given a 3D kernel , a natural design choice for kernel shrinking is to represent it as the summation of its sub-kernel and the remainder:
where we denote the index set of within the original 3D convolution as . The remainder after removing sub-kernel is . is a index tensor, and is the indicator function. If the following inequality holds ( is a small constant):
where the left-hand side can be deemed as an importance measure with as the norm constant, which is similar to unstructured weight pruning , yet here we aim for a structured sub-kernel, then we can remove the remainder and reduce the 3D kernel to its sub-kernel as a replacement:
However, as in Fig. 2(a), even considering only kernel sizes, there are sub-kernel options for a 3D kernel, which makes it impractical to find the optimal sub-kernel via manual designs.
Therefore, we provide a new perspective—to formulate this problem as path-level selection , i.e., to encode sub-kernels into a multi-path super-network and select the optimal path among them (Fig. 2(c)). Then this problem can be solved by neural architecture search algorithms.
We first represent a 3D kernel as follows (Fig. 2(b)):
where is the weight of -th sub-kernel , , , . With this formulation, the problem of finding the optimal sub-kernel of becomes finding an optimal set of and then keeping the sub-kernel with maximum . Due to the linearity of convolution, Eqn. (1) can then be derived as below:
To solve for the path weights , we reformulate Eqn. (8) as an over-parameterized multi-path super-network, where each candidate path consists of a sub-kernel (Fig. 2). By relaxing the selection space, i.e., relaxing the conditions on to be continuous, Eqn. (8) can be then formulated as a differential NAS problem and optimized via gradient descent .
While previous replacements [21, 31, 40] consist of homogeneous operations in the same layer, we argue that a more efficient replacement requires customized operations at each channel. As shown in Fig. 3, kernel shrinking in a channel-wise fashion can generate heterogeneous operations which extract diverse and complementary information to be employed in the same layer, and thereby yields a fine-grained and thus more efficient replacement (Fig. 3(d)) than prior methods which use layer-wise replacements (Fig. 3(a) & (b) & (c)).
Contrary to previous layer-wise replacement, our core idea is to replace 3D kernel at each channel individually, thus the target is to find the optimal sub-kernel as the substitute for the -th 3D kernel :
where the optimal size of the sub-kernel () is subjected to , , . Hence the computation incurred by Eqn. (1) can be largely reduced by our replacement as above.
With our channel-wise replacement design, the original 3D kernels are substituted by a series of diverse and cheaper operations at different channels as following (recall that is the output channel number):
Benefited from channel-wise shrinkage, our method provides a more general and flexible design for replacing 3D convolution than previous approaches (Eqn. (2) and Eqn. (3)), where it can also be easily reduced to arbitrary alternatives (, 2D, P3D) by integrating these operations into the set of candidate sub-kernels. An illustration example can be found in Fig 3.
As aforementioned, given the tremendous search space, it is impractical to manually find the optimal replacement for a 3D kernel through a trial-and-error process. Especially, it becomes even more intractable as the replacement procedure is conducted in a channel-wise manner. Therefore, we propose to automate the process of learning an efficient replacement to fully exploit the redundancies in 3D convolution operations. By formulating kernel shrinkage as a path-level selection problem, we first construct a super-network where every candidate sub-kernel is encapsulated into a separate trainable branch (Fig. 2(c)) at each channel. Once the path weights are learned by differentiable NAS , the optimal path (sub-kernel) can be determined.
Search Space. A well-designed search space is crucial for NAS algorithms. We aim to answer the following questions: Should the 3D convolution kernel be kept or replaced per channel? If replaced, which operation should be deployed instead?
To address these questions, for each channel, we define a set , which contains all candidates of sub-kernels (replacement) given a 3D kernel :
As the original 3D convolution kernel is a sub-kernel of itself, i.e., , it can be kept in the final configuration. The final optimal operation is chosen among .
Another critical problem for NAS is how to reduce the search cost. To make the search cost affordable, we adopt a differentiable NAS paradigm where the model structure is discovered in single-pass super-network training. Drawing inspirations from previous NAS methods, we directly use the scaling parameters in the normalization layers as the path weights of the multi-path super-network (Eqn. (8)) [10, 27]. And our goal is then equivalent to finding the optimal sub-network architecture based on the learned path weights. To achieve this goal, we introduce two different search algorithms which aim at either maximizing the performance or optimizing the computation cost of the sub-network as a search priority, named as performance-priority and cost-priority search, respectively.
Performance-Priority Search. As the title implies, the search is performed in a “performance-priority” manner, which means to maximize the performance by finding the optimal sub-kernels given the backbone architecture. During the search procedure, following [2, 3], we randomly pick an operation for each channel at each training iteration. This not only allows for memory saving by only activating and updating one path per iteration but also propels the weights of the paths in the super-network training to be decoupled. After the super-network is trained, the operation with the largest path weight will be picked as the final choice for the given output channel:
Cost-Priority Search. Performance-priority search regards the performance as the search priority and may neglect the possible negative effects on the computation cost. In order to obtain more compact models, we introduce a “cost-priority” search method. Following , the output of each sub-kernels are concatenated and aggregated by the following convolution. To make the searched architecture more compact, we introduce a “cost-aware” penalty term—A lasso term on which is used as the penalty loss to push many path weights to near-zero values. Therefore, the total training loss can be written as:
where is a “cost-aware” term to balance the penalty term, which is proportional to the parameters or FLOPs cost of the sub-kernel. In Table 1, we also empirically show that this term can lead to a more efficient architecture. The introduction of aims at giving more penalty to “expensive” operations and leading to a more efficient replacement. is the coefficient of the penalty term, and is the conventional training loss (, cross-entropy loss combined with the regularization term such as weight decay).
Dataset. We evaluate the proposed method on two public datasets: 1) Pancreas Tumours dataset from the Medical Segmentation Decathlon Challenge (MSD) , which contains 282 cases with both pancreatic tumours and normal pancreas annotations; and 2) NIH Pancreas Segmentation dataset , consisting of 82 abdominal CT volumes. For the MSD dataset, we use 226 cases for training and evaluate the segmentation performance on the rest 56 cases. The resolution along the axial axis of this dataset is extremely low and the number of slices can be as small as 37. For data preprocessing, all images are resampled to an isotropic 1.0 resolution. For the NIH dataset, the resolution of each scan is , where is the number of slices along the axial axis and the voxel spacing ranges from 0.5 to 1.0 . We test the model in a 4-fold cross-validation manner following previous methods .
|Methods||Type||Params (M)||FLOPs (G)||Pancreas DSC (%)||Tumor DSC (%)||Average DSC (%)|
Implementation Details. For all experiments, C2FNAS  is used as the backbone architecture and the search is performed in “performance-priority” manner unless otherwise specified. When replacing the operations, we keep the stem (the first two and the last two convolution layers) as the same. For 3D medical image, for simplicity, we choose a set of most representative sub-kernels as . The operations set contains conv1D (, , ), conv2D (, , ) from different directions, and conv3D (). For every 3D kernel at each output channel, a sub-kernel from will be chosen as the replacement. For NAS settings, we include both “performance-priority” and “cost-priority” search for performance comparison. For manual settings, we assign all candidates operations uniformly across the output channels.
Training stage. For the MSD dataset, we use random crop with patch size of , random rotation (, , , and
) and flip in all three axes as data augmentation. The batch size is 8 with 4 GPUs. We use SGD optimizer with learning rate starting from 0.01 with polynomial decay of power of 0.9, momentum of 0.9, and weight decay of 0.00004. The loss function is the summation of Dice loss and cross-entropy loss. For NIH dataset, the patch size is set as following . The found architecture will be trained from scratch to ensure its effectiveness. Both the super-network and the found architecture are trained under the same settings as aforementioned. For search stage with “cost-priority” setting, a lasso term with coefficient is applied to the path weights, which is further re-weighted by for 3D, 2D, 1D operations respectively. After training finishes, the operation with the largest is chosen as the final replacement for 3D operation for each channel.
Testing stage. We test the network in a sliding-window manner, where the patch size is
and stride isfor the MSD dataset and patch size is and stride is for NIH dataset. The result is measured with Dice-Sørensen coefficient (DSC) metric, which is formulated as , where and denote the prediction and ground-truth voxels set for a foreground class. The DSC has a range of with 1 implying a perfect prediction.
|Method||Type||Params||Average DSC||Max DSC||Min DSC|
|Zhou et al. ||Manual||268.56M||82.37%||90.85%||62.43%|
|Oktay et al. ||Manual||103.88M||83.10%||-||-|
|Yu et al. ||Manual||268.56M||84.50%||91.02%||62.81%|
|Zhu et al. ||Manual||20.06M||84.59%||91.45%||69.62%|
|Zhu et al. ||Auto||29.74M||85.15%||91.18%||70.37%|
NAS Settings vs. Manual Settings. As can be observed from Table 1, even under manual settings, CAKES is already more efficient with slightly inferior performance (, from to manual , parameters drop from 22.50M to 11.29M, and FLOPs drop from 188.48G to 97.77G, with performance gap of 1.0%). Besides, under the manual settings outperforms its counterpart with standard convolution layers () by more than with the same model size, which indicates the benefits of our design. In addition, under NAS settings, the proposed search method can further reduce the performance gap and even surpasses original 3D model with much fewer parameters and computations, e.g., model size is reduced from 22.50M () to 11.26M (), and FLOPs drop from 188.48G () to 99.68G (), with a performance improvement of 0.46%. Compared with , also yields superior performance (+1.60%) with a more compact model (11.26M vs. 13.16M), which further indicates the effectiveness of the proposed method.
Influence of the Search Space. From Table 1, we can see that using different search space, our proposed CAKES consistently outperforms its counterpart with standard 1D/2D/3D convolutions. For instance, compared with and , our CAKES leads to a significant performance improvement (+1.70% for , +1.15% for ) with same model size and computation cost. Out of different search spaces, we find that (7.56M params and 67.53G FLOPs) offers the most efficient model with slightly worse performance, while (11.29M params and 97.77G FLOPs) can already surpass the 3D baseline (22.50M params and 188.48G FLOPs) with half parameters and computation cost. After we enlarge the search space, successfully finds a configuration with even higher performance/efficiency (last 2 rows of Table 1).
Generalization to different backbone architectures. We also test our method on different backbone architectures. Applying to another state-of-the-art model 3D ResDSN , our method consistently leads to a more efficient model with much fewer parameters (10.03M to 4.63M) and FLOPs (192.07G to 98.12G) with comparable performance (61.96% to 61.65%).
NIH Results. We compare the proposed CAKES with state-of-the-art methods in Table 2, where it can be observed that the proposed CAKES leads to a much more compact model size compared to other alternatives. for instance, our model size is more than smaller than  and . It is well worth noting that our model even performed in a single-stage fashion already outperforms many state-of-the-art methods conducted in a two-stage coarse-to-fine manner [51, 48, 53] on the NIH pancreas dataset with much fewer model parameters and FLOPS. It is also noteworthy to mention that the applied architecture is searched from another dataset (MSD), where the images are collected under different protocols and have different resolutions. This result indicates the generalization of our searched model. By directly applying the architecture searched on the MSD dataset, our method also outperforms  which was directly searched on the NIH dataset with less than parameters of .
Dataset. Something-Something V1  is a large scale action recognition dataset which requires comprehensive temporal modeling. There are totally about 110k videos for 174 classes with diverse objects, backgrounds, and viewpoints.
Implementation Details. We adopt ResNet50 
with pre-trained weight on ImageNet as our backbone. The 3D convolution weights are initialized by repeating 2D kernel by 3 times along the temporal dimension following , while 1D convolution weights are initialized by averaging the 2D kernel on spatial dimensions and then repeat by 3 times along temporal axis. For the temporal dimension, we use the sparse sampling method as in TSN . And for spatial dimension, the short side of the input frames are resized to 256 and then cropped to .
We use random cropping and flipping as data augmentation. We train the network with a batch size of 96 on 8 GPUs with SGD optimizer. The learning rate starts from 0.04 for the first 50 epochs and decays by a factor of 10 for every 10 epochs afterwards. The total training epochs are 70. We also set dropout ratio to 0.3 following. The training settings remain the same for both final network and search stage, except that when searching with “cost-priority”, we use a lasso term with and for 3D, 2D, 1D operations respectively.
Testing stage. we sample the middle frame in each segment and do center crop for each frame. We report the results of single crop, unless otherwise specified.
|Model||Type||Params (M)||FLOPs (G)||top1||top5|
|ECO ||BNIncep+3D Res18||8||32G||47.5M||39.6||-|
|ECO ||BNIncep+3D Res18||16||64G||47.5M||41.4||-|
|I3D ||3D ResNet-50||322clip||153G2||28.0M||41.6||72.2|
|NL I3D ||3D ResNet-50||322clip||168G2||35.3M||44.4||76.0|
|NL I3D+GCN ||3D ResNet-50+GCN||322clip||303G2||62.2M||46.1||76.8|
Ablation Study. We study the impacts of both different operations set and manual/auto configurations. The results are summarized in Table 3. Considering the spatial-temporal property of video data, we study the following different operations set: (1) Spatial 2D convolution and temporal 1D convolution; (2) Spatial 2D convolution and 3D convolution; (3) Spatial 2D, temporal 1D, and 3D convolutions.
Operation set with 1D & 2D sub-kernels. As shown in Table 3, surpass the 2D baseline by a large margin (46.8% vs. 17.2%), while having a smaller model size (20.5M vs. 23.9M). This suggests that TSN  may lack the ability to capture temporal information, therefore replacing some of the 2D operations to temporal 1D operations can significantly increase the performance and reduce the model size. Besides, it also surpasses P3D, where each 2D convolution is followed by a temporal 1D convolution, with a significant advantage on both performance (46.8% vs. 44.8%) and model cost (20.5M params and 29.3G FLOPs compared to 27.6M params and 62.6G FLOPs), indicating makes better use of redundancies in the network than P3D. Therefore, using operation set containing 1D and 2D sub-kernels can be an ideal design when looking for efficient video understanding networks.
Operation set with 2D & 3D sub-kernels. We aim to see how does balance the trade-off between performance and model cost. From Table 3, under the “cost-priority” setting, yields a much more compact model (35.7M and 41.4G FLOPs) with a comparable performance to C3D. When it comes to the “performance-priority” setting, searches a slightly larger model (37.5M params and 50.7G FLOPs), yet its performance boosts significantly to 47.4%.
Operation set with 1D & 2D & 3D sub-kernels. Compared to , shows a slightly inferior performance (-0.5%) with much fewer FLOPs (38.7G vs. 50.7G). Besides, under the “performance-priority” setting, produces a comparable performance to with much less computation cost (43.9G vs. 50.7G). This result indicates that with a more general search space (, 1D, 2D, and 3D), the proposed CAKES can find more flexible designs, which lead to better performance/efficiency.
Results. A comparison with other state-of-the-art methods is shown in Table 4. We report the model performance under both 8-frame and 16-frame settings. Compared with other state-of-the-art methods, sampling only 8 frames can already outperform most current methods. With smaller parameters and FLOPs, surpasses those complex models such as non-local networks  with graph convolution . Comparing to other efficient video understanding framework such as ECO  and TSM , our model is not only more light-weight (58.6G vs. 64G/65G), but also delivers a better performance (48.0% vs. 41.4%/47.2%). And our best model achieves a new state-of-the-art performance of 49.4% top-1 accuracy with a moderate model size. An interesting finding is that although shows similar performances to with 8-frame inputs, it achieves a much higher accuracy when it comes to the 16-frame scenario, which demonstrates that with a more general search space, also shows a stronger transfer-ability than other counterparts.
Cost-Priority Architectures. We plot the found architecture on both medical data () and video data () respectively in Fig 4. For the architecture searched on something-something dataset, we note that the algorithm prefers efficient 2D operations at the bottom of the network, and favors 3D operation at the top of the network. This implies that the search algorithm successfully find that temporal information extracted from high-level features is more useful, which coincides with the observation in . For the architecture found on the MSD dataset, we calculated the number of operations computed on all three data dimensions, and numbers are (, , ). This suggests that the searched model, unlike the searched network for videos, tends to treat each dimension equally, which aligns with the property of volumetric medical data. In addition, the number of 1D, 2D, 3D operations are , , and respectively, indicating that the efficient 1D/2D operations are more preferred.
Performance-Priority Architectures. As shown in Fig. 4(b), for , the pure temporal sub-kernel () is rarely chosen at the top of the network, while it plays a more important role as the network goes deeper. This observation agrees with our previous finding: temporal information extracted from high-level features is more useful. For on medical image as shown in Fig. 4(a), we calculate the numbers of each type of sub-kernels and computation on three axes. The numbers of 1D, 2D, and 3D sub-kernel are , , and the number of operations computed on all three data dimensions are , , , which again coincides with our previous finding that tends to treat each dimension equally for the symmetric volumetric data. Compared to cost-priority , we notice that performance-priority favors the operation set with more 3D sub-kernels, which can provide a larger model capacity.
As an important solution to various 3D vision applications, 3D networks still suffer from over-parameterization and heavy computations. How to design efficient alternatives to 3D operations remain an open problem. In this paper, we propose Channel-wise Automatic KErnel Shrinking (CAKES), where standard 3D convolution kernels are shrunk into efficient sub-kernels at channel-level to obtain efficient 3D models. Besides, by formulating kernel shrinkage as a path-level selection problem, our method can automatically explore the redundancies in 3D convolutions and optimize the replacement configuration. By applying on different backbone models, the proposed CAKES significantly outperforms previous 2D/3D/P3D and other state-of-the-art methods on both 3D medical image segmentation and action recognition from videos.
Xception: deep learning with depthwise separable convolutions. In CVPR, pp. 1251–1258. Cited by: §2.1.
Regularized evolution for image classifier architecture search. In AAAI, Vol. 33, pp. 4780–4789. Cited by: §2.3.
Recurrent saliency transformation network: incorporating multi-stage visual cues for small organ segmentation. In CVPR, pp. 8280–8289. Cited by: §4.1, Table 2.