Reinventing 2D Convolutions for 3D Medical Images

11/24/2019 ∙ by Jiancheng Yang, et al. ∙ Shanghai Jiao Tong University 0

There has been considerable debate over 2D and 3D representation learning on 3D medical images. 2D approaches could benefit from large-scale 2D pretraining, whereas they are generally weak in capturing large 3D contexts. 3D approaches are natively strong in 3D contexts, however few publicly available 3D medical dataset is large and diverse enough for universal 3D pretraining. Even for hybrid (2D + 3D) approaches, the intrinsic disadvantages within the 2D / 3D parts still exist. In this study, we bridge the gap between 2D and 3D convolutions by reinventing the 2D convolutions. We propose ACS (axial-coronal-sagittal) convolutions to perform natively 3D representation learning, while utilizing the pretrained weights from 2D counterparts. In ACS convolutions, 2D convolution kernels are split by channel into three parts, and convoluted separately on the three views (axial, coronal and sagittal) of 3D representations. Theoretically, ANY 2D CNN (ResNet, DenseNet, or DeepLab) is able to be converted into a 3D ACS CNN, with pretrained weights of same parameter sizes. Extensive experiments on proof-of-concept dataset and several medical benchmarks validate the consistent superiority of the pretrained ACS CNNs, over the 2D / 3D CNN counterparts with / without pretraining. Even without pretraining, the ACS convolution can be used as a plug-and-play replacement of standard 3D convolution, with smaller model size.



There are no comments yet.


page 14

Code Repositories


[WIP] Reinventing 2D Convolutions for 3D Medical Images

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A comparison between the proposed ACS convolutions and prior art on modeling the 3D medical images: pure 2D / 2.5D approaches with 2D convolution kernels, pure 3D approaches with 3D convolution kernels, and hybrid approaches with both 2D and 3D convolution kernels. The ACS convolutions run multiple 2D convolution kernels among the three views (axial, coronal and sagittal).

Emerging deep learning technology has been dominating the medical image analysis research

[litjens2017survey, shen2017deep], in a wide range of data modalities (e.g., ultrasound [chen2016iterative, droste2019ultrasound], CT [roth2015deeporgan, yan2018deep, bilic2019liver], MRI [menze2014multimodal, dou2016automatic, bien2018deep], X-Ray [wang2017chestx, irvin2019chexpert, johnson2019mimic]) and tasks (e.g., classification [gulshan2016development, esteva2017dermatologist], segmentation [isensee2018nnu, tang2019clinically], detection [yan2019mulan, tang2019nodulenet], registration [balakrishnan2019voxelmorph, dalca2018unsupervised]). Thanks to contributions from dedicated researchers from academia and industry, there have been much larger medical image datasets than even before. With large-scale datasets, strong infrastructures and powerful algorithms, numerous challenging problems in medical images seem solvable. However, the data-hungry nature of deep learning limits its applicability in various real-world scenarios with limited annotations. Compared to millions (or even billions) of annotations in natural image datasets, the medical image datasets are never too large. Especially for 3D medical images, datasets with thousands of supervised training annotations [setio2017validation, zbontar2018fastmri, simpson2019large] are so-called “large”, due to several difficulties in medical annotations: hardly-accessible and high dimensional medical data, expensive expert annotators (radiologists / clinicians), and severe class-imbalance issues [yan2019holistic].

Transfer learning, with pretrained weights from large-scale datasets (e.g

., ImageNet

[deng2009imagenet], MS-COCO [lin2014microsoft]), is a de-facto paradigm for tasks with insufficient data. Unfortunately, widely-used pretrained CNNs are developed on 2D datasets, which are non-trivial to transfer to 3D medical images. Prior art on 3D medical images follows eithor 2D-based approaches or 3D-based approaches (compared in Fig. 1). 2D-based approaches [10.1007/978-3-319-10404-1_65, yu2018recurrent, ni2019elastic] benefit from large-scale pretraining on 2D natural images, while the 2D representation learning are fundamentally weak in large 3D contexts. 3D-based approaches [cciccek20163d, milletari2016v, zhao20183d] learn natively 3D representations. However, few publicly available 3D medical dataset is large and diverse enough for universal 3D pretraining. Therefore, compact network design and sufficient training data are essential for training the 3D networks from scratch. Hybrid (2D + 3D) approaches [li2018h, xia2018bridging, zheng2019new] seem to get the best of both worlds, nevertheless these ensemble-like approaches do not fundamentally overcome the intrinsic issues of 2D-based and 3D-based approaches. Please refer to Sec. 2 for in-depth discussion on these related methods.

There has been considerable debate over 2D and 3D representation learning on 3D medical images: prior studies choose either large-scale 2D pretraining or natively 3D representation learning. This paper presents an alternative to bridge the gap between the 2D and 3D approaches. To solve the intrinsic disadvantages from the 2D convolutions and 3D convolutions in modeling 3D images, we argue that an ideal method should adhere to the following principles:

1) Natively 3D representation: it learns natively 3D representations for 3D medical images;

2) 2D weight transferable: it benefits from the large-scale pretraining on the 2D images [deng2009imagenet, lin2014microsoft, wang2017chestx, irvin2019chexpert, johnson2019mimic];

3) ANY model convertible: it enables any 2D model, including classification [he2016deep], detection [lin2017focal] and segmentation [chen2018encoder] backbones, to be converted to a 3D model.

These principles cannot be achieved simultaneously with standard 2D convolutions or standard 3D convolutions, which directs us to develop a novel convolution operator. Inspired from the widely-used tri-planar representations of 3D medical images [10.1007/978-3-319-10404-1_65], we propose ACS convolutions satisfying these three principles. Instead of explicitly treating the input 3D volumes as three orthogonal 2D planar images [10.1007/978-3-319-10404-1_65] (axial, coronal and sagittal), we operate on the convolution kernels to perform view-based 3D convolutions, via splitting the 2D convolution kernels into three parts by channel. Notably, no additional 3D fusion layer is required to fuse the three-view representations from the 3D convolutions, since they will be seamlessly fused by the subsequent ACS convolution layers (see details in Sec. 3).

The ACS convolution aims at a generic and plug-and-play replacement of standard 3D convolutions for 3D medical images. Our experiments empirically prove that, even without pretraining, the ACS convolution is comparable to 3D convolution with a smaller model size. When pretrained on large 2D datasets, it consistently outperforms 2D / 3D convolution by a large margin. To improve research reproducibility, a PyTorch

[paszke2017automatic] reference implementation of ACS convolution is provided in the supplementary materials. Using the provided function, 2D CNNs could be converted into ACS CNNs for 3D images, with a single line of code.222Code is available at

2 Related Work on 3D Medical Images

2.1 2D / 2.5D Approaches

Transfer learning from 2D CNNs, trained on large-scale datasets (e.g., ImageNet [deng2009imagenet]), is a widely-used approach in 3D medical image analysis. To mimic the 3-channel image representation (i.e., RGB), prior studies follow either multi-planar or multi-slice representation of 3D images as 2D inputs. In these studies, pretrained 2D CNNs are usually fine-tuned on the target medical dataset.

Early study [10.1007/978-3-319-10404-1_65] proposes tri-planar representation of 3D medical images, where three views (axial, coronal and sagittal) from a voxel are regarded as the three channels of 2D input. Although this method is empirically effective, there is a fundamental flaw that the channels are not spatially aligned. More studies follow tri-slice representations [Ding2017AccuratePN, yu2018recurrent, ni2019elastic], where a center slice together with its two neighbor slices are treated as the three channels. In these representations, the channels are spatially aligned, which comforms to the inductive biases in convolution. There are also studies [yu2018recurrent, perslev2019one] combining both multi-slice and multi-planar approaches, using multi-slice 2D representations in multiple views. The multi-view representations are generally averaged [yu2018recurrent] or fused by additional networks [perslev2019one].

Even though these approaches benefit from large-scale 2D pretraining, which is empirically effective in numerous studies [Long2015FullyCN, esteva2017dermatologist, lin2017focal, chen2018encoder], both multi-slice and multi-planar representation with 2D convolutions are fundamentally weak in capturing large 3D contexts.

2.2 3D Approaches

Instead of regarding the 3D spatial information as input channels in 2D approaches, there are numbers of studies using pure 3D convolutions for 3D medical image analysis [cciccek20163d, milletari2016v, kamnitsas2017efficient, dou20173d, zhao20183d, zhao2019toward, yang2019probabilistic]. Compared to limited 3D contexts along certain axis in 2D approaches, the 3D approaches are theoretically capable of capturing arbitrarily large 3D contexts in any axis. Therefore, the 3D approaches are generally better at tasks requiring large 3D contexts, e.g., distinguishing small organs, vessels, and lesions.

However, there are also drawbacks for pure 3D approaches. One of the most important is the lack of large-scale universal 3D pretraining. For this reason, efficient training of 3D networks is a pain point for 3D approaches. Several techniques are introduced to (partially) solve this issue, e.g., deep supervision [dou20173d], compact network design [zhou2018unet++, zhao20183d]. Nevertheless, these techniques are not directly targeting the issue of 3D pretraining.

2.3 Hybrid Approaches

Hybrid approaches are proposed to combine the advantages of both 2D and 3D approaches [li2018h, xia2018bridging, zheng2019new, perslev2019one]. In these studies, 2D pretrained networks with multi-slice inputs, and 3D randomly-initialized networks with volumetric inputs are (jointly or separately) trained for the target tasks.

The hybrid approaches could be mainly categorized into multi-stream and multi-stage approaches. In multi-stream approaches [li2018h, zheng2019new], 2D networks and 3D networks are designed to perform a same task (e.g., segmentation) in parallel. In multi-stage (i.e., cascade) approaches [xia2018bridging, zheng2019new, perslev2019one], several 2D networks (and 3D networks) are developed to extract representations from multiple views, and a 3D fusion network is then used to fuse the multi-view representations into 3D representations to peform the target tasks.

Although empirically effective, the hybrid approaches do not solve the intrinsic disadvantages of 2D and 3D approaches: the 2D parts are still not able to capture large 3D contexts, and the 3D parts still lacks large-scale pretraining. Besides, these ensemble-like methods are generally redundant to deploy in practice.

2.4 Transfer Learning & Self-Supervised Learning

Medical annotations require expertise in medicine and radiology, which are thereby expensive to be scalable. For certain rare diseases or novel applications (e.g., predicting response for novel treatment [sun2018radiomics]), the data scale is naturally very small. Transfer learning from large-scale datasets to small-scale datasets is a de-facto paradigm in this case.

Human without any radiological experience could recognize basic anatomy and lesions on 2D and 3D images with limited demonstration. Based on this observation, we believe that transfer learning from universal vision datasets (e.g., ImageNet [deng2009imagenet], MS-COCO [lin2014microsoft]) should be beneficial for 3D medical image analysis. Although there is literature reporting that universal pretraining is useless for target tasks [he2019rethinking, raghu2019transfusion], this phenomenon is usually observed when target datasets are large enough. Apart from boosting target task performance, the universal pretraining is able to improve model robustness and uncertainty quantification [pmlr-v97-hendrycks19a, huang2019evaluating].

Unfortunately, 2D-to-3D transfer learning has not been adequately studied. Research efforts [kamnitsas2017efficient, gibson2018niftynet] have been paid to pretrain natively 3D CNNs on 3D datasets, however few publicly available 3D medical dataset is large and diverse enough for universal pretraining. Prior research explores the transfer leanring of 3D CNNs trained on spatio-temporal video datasets [hussein2017risk]. However, there are two kinds of domain shift between video and 3D medical images: 1) natural images vs. medical images, and 2) spatio-temporal data vs. 3D spatial data. The domain shift makes video pretraining [hara2018can] less applicable for 3D medical images. To reduce domain shift, there is research (Med3D [chen2019med3d]) building pretrained 3D models on numbers of 3D medical image datasets. Despite the tremendous effort on collecting data from multiple sources, the data scale of involved 1,000+ training samples is still too much small compared to 1,000,000+ training samples in natural image datasets.

Source Data Scale Data Diversity Supervised Medical
2D Image Very Large Very Diverse Y N
Video [hara2018can] Large Diverse Y N
Med3D [chen2019med3d] Moderate Moderate Y Y
MG [zhou2019models] Large Moderate N Y
Table 1: A comparison of transfer learning for 3D medical images from various sources, in terms of source data scale, source data diversity, whether supervised pretraining and whether medical data.

In addition to supervised pretraining, Models Genesis [zhou2019models]

explores unsupervised (self-supervised) learning to obtain the pretrained 3D models. Though very impressive, the model performance of up-to-date unsupervised learning is generally not comparable to that of fully supervised learning; even


unsupervised / semi-supervised learning techniques

[berthelot2019mixmatch, henaff2019data] could not reproduce the model performance using full supervised training data.

Table 1 compares the sources of transfer learning for 3D medical images. Compared to transfer learning from video [hara2018can] / Med3D [chen2019med3d] / Models Genesis [zhou2019models], the key advantage of 2D image pretraining is the overwhelming data scale and diversity of datasets. With the ACS convolutions proposed in this study, we are able to develop natively 3D CNNs using 2D pretrained weights. We compare these pretraining approaches in our experiments, and empirically prove the superiority of the proposed ACS convolutions.

Note that the contribution of the ACS convolutions is orthogonal to pretraining data. It is possible to pretrain ACS CNNs on 2D images, videos and 3D medical images with supervised / self-supervised learning. The paper uses ACS convolution with supervised pretraining on 2D natural images to demonstrate its effectivity, flexibility and versatility.

3 ACS Convolutional Neural Networks

We introduce the ACS (axial-coronal-sagittal) convolutions, how to convert a 2D CNN to an ACS CNN, and the counterparts and variants of the proposed method.

Figure 2: Illustration of ACS convolutions and 2D-to-ACS model conversion. With a kernel-splitting design, a 2D convolution kernel could be seamlessly transferred into ACS convolution kernels to perform natively 3D representation learning. The ACS convolutions enable ANY 2D model (ResNet [he2016deep], DenseNet [huang2017densely], or DeepLab [chen2018encoder]) to be converted into a 3D model.

3.1 ACS Convolutions

Input: , ,

, stride:

dilation: , view : ,
kernel split: , ,

: compute the padded tensor given a certain axis to satisfy the final output shape same as Conv3D,

unsqueeze: expand tensor dimension given a certain axis.
1 Compute ACS kernels: , ,
unsqueeze ;
unsqueeze ;
unsqueeze ;
2 Compute view-based 3D features from three views:
for  in  do
        Conv3D pad ;
3 concatenate .
Algorithm 1 ACS Convolution

Convolution layers capture spatial correlation. Intuitively, the formal difference between 2D and 3D convolutions is the kernel size: the 2D convolutions use 2D kernels () for 2D inputs (), whereas the 3D convolutions use 3D kernels () for 3D inputs (), where , denote the channels of inputs and outputs, denotes the kernel size, and denotes the input size. To transfer the 2D kernels to 3D kernels, there are basically two prior approaches: 1) “inflate” the pretrained 2D kernels into 3D kernels size (), i.e., Inflated 3D (I3D [carreira2017quo]), where the 2D kernels are repeated along an axis and then normalized; 2) unsqueeze the 2D kernels into pseudo 3D kernels on an axis (), i.e., AH-Net-like [liu20183d], which could not effectively capture 3D contexts. Note that in both cases, the existing methods assume a specific axis to transfer the 2D kernels. It is meaningful to assign a special axis for spatio-temporal videos, while controversial for 3D medical images. Even for anisotropic medical images, any view of the 3D image is still a 2D spatial image.

Based on this observation, we develop ACS convolutions to learn spatial representations from the axial, coronal and sagittal views. Instead of treating channels of 2D kernels equally [carreira2017quo, liu20183d], we split the kernels into three parts for extracting 3D spatial information from the axial, coronal and sagittal views. The detailed calculation of ACS convolutions are shown in Algorithm 1. For simplicity, we introduce the ACS convolutions with same padding (Fig. 2).

Given a 3D input , we would like to obtain a 3D output , with pretrained / non-pretrained 2D kernels . Here, and denote the input and output channels, and denote the input and output sizes, denotes the kernel size. Instead of presenting 3D images into tri-planar 2D images [10.1007/978-3-319-10404-1_65], we split and reshape the kernels into three parts (named ACS kernels) by the output channel, to obtain the view-based 3D representations for each volume: , , , where . It is theoretically possible to assign an “optimal axis” for a 2D kernel; However, considering the feature redundancy in CNNs [han2015deep], in practice we simply set . We then compute the view-based 3D features from axial, coronal and sagittal views via 3D convolutions:

Conv2D ACSConv
Conv2D Conv3D
{Batch,Group}Norm2D {Batch,Group}Norm3D
{Max,Avg}Pool2D {Max,Avg}Pool3D
Table 2: Main operator conversion from 2D CNNs into ACS CNNs. , and denote the kernel sizes.

The output feature is obtained by concatenating , and by the channel axis. It is noteworthy that, no 3D fusion layer is required additionally. The view-based output features will be automatically fused by subsequent convolution layers, without any additional operation, since the convolution kernels are not split by input channel. Thanks to the linearity of convolution, the numerical scale of ACS convolution kernels is same as the 2D convolution kernels, thereby no weight rescaling [carreira2017quo] is needed.

Apart from convolutions, the remaining layers are trivial to be converted. The proposed method enables ANY 2D model to be converted into a 3D model. Table 2 lists how operators in 2D CNNs are converted to those in ACS CNNs.

3.2 Counterparts and Related Methods

2D Convolutions. We include a simple AH-Net-like [liu20183d] 2D counterpart, by replacing all ACS convolutions in ACS CNNs with Conv3D . We name this pseudo 3D counterpart as “2.5D” in our experiments, which enables 2D pretrained weight transferring with ease.

3D Convolutions. For the 3D counterparts, we replace all convolutions in ACS CNNs with standard 3D convolutions. Various pretraining sources (I3D [carreira2017quo] with 2D images, Med3D [chen2019med3d], Video [hara2018can]) are included for fair comparison. If there is any difference between the converted 3D models and the pretrained 3D models, we keep the pretrained 3D network architectures to load the pretrained weights. Models Genesis [zhou2019models] uses 3D UNet-based [cciccek20163d, milletari2016v] network architecture. We train the same network from scratch / with its self-supervised pretraining to compare with our models.

Table 3 compares the time and space complexity of 2D (2.5D), 3D and ACS convolutions. The proposed ACS convolution could be used as a generic and plug-and-play replacement of 3D convolution, with less computation and smaller size. Besides, the ACS convolution enables 2D pretraining. We demonstrate its superiority over the counterparts with extensive experiments (Sec. 4).

3.3 ACS Convolution Variants

Apart from the kernel splitting approach used in the proposed ACS convolutions, there are several possible variants to implement the 2D-transferable, ACS-like convolutions.

Kernels FLOPs Memory Parameters
Table 3: Theoretical analysis of space and time complexity, for 2D (2.5D), 3D, ACS, Mean-ACS, and Soft-ACS convolutions. Bias terms are not counted in parameter size.

Mean-ACS convolutions. Instead of splitting the 2D convolution kernels, we replicate and reshape into , , , and obtain the 3D feature maps by , , . The output features is


Soft-ACS convolutions. Note that the Mean-ACS convolution uses a symmetric aggregation, thereby it could not distinguish any view-based information. To this regard, we introduce weighted sum of Mean-ACS, i.e., Soft-ACS,


where are learnable weights.

In Table 3, we compare the time and space complexity. The two variants are more computationally intensive in terms of FLOPs and memory. Unfortunately, they do not provide significant performance boost empirically. Therefore, we only report the model performance of ACS convolutions in Sec. 4, and analyze these variants in Sec. 5.1.

4 Experiments

We experiment with the proposed method on a proof-of-concept dataset and medical benchmarks. To fairly compare model performance, we include several counterparts (2.5D/3D/ACS {Network} r./p.) with same experiment setting, where r. denotes random initialization, and p. denotes pretraining on various sources. We use separate network architectures in different experiments to demonstrate the flexibility and versatility of the proposed method.

4.1 Proof-of-Concept


We first validate our method on a proof-of-concept dataset to perform semantic segmentation task. As illustrated in Fig. 3, the synthetic dataset consists of sufficient 2D samples ( for training and for evaluation) and limited 3D samples ( for training and for evaluation), in order to validate the usefulness of the proposed method in 2D-to-3D transfer learning. The 2D dataset is for pretraining, which covers 2 foreground classes including circle and square, while the 3D dataset contains 5 foreground classes, including sphere, cube, cylinder, cone and pyramid. Note that the shapes of 2D dataset are exactly the projected single views of 3D volumes (except for triangle), thereby the 2D pretraining should be useful in the 3D segmentation. For both 2D and 3D dataset, the object size, location and direction are randomly assigned, and Gaussian noise is added on each pixel. The input sizes are and for 2D and 3D dataset. Details of the synthetic dataset are provided in supplementary materials.

Experiment Setting.

We compare our ACS model with 2.5D and 3D counterparts (Sec. 3.2) under random initialization or pretraining setting. All models share a same UNet [ronneberger2015u, cciccek20163d] architecture with down-sampling twice, except for the convolution modules. Dice loss is used for training both 2D and 3D UNet. We first train a 2D UNet on 2D dataset until convergence, which reaches a Dice of on 2D dataset. Its weights could be used to transfer to 3D models with ACS convolutions. Note that only 2.5D and ACS UNet are capable of loading the 2D pretrained weights without additional processing. For training on 3D dataset, We apply an Adam optimizer [kingma2014adam]

with 0.001 learning rate and train models for 50 epochs with a batch size of 4. We report the Dice and mIoU averaged on the 5 classes of 3D dataset.

Result Analysis.

Models Dice mIoU Model Size
2.5D UNet r. 82.24 72.48 1.6 Mb
2.5D UNet p. 82.71 73.28 1.6 Mb
3D UNet r. 94.63 90.78 4.7 Mb
ACS UNet r. 94.68 90.71 1.6 Mb
ACS UNet p. 95.44 91.99 1.6 Mb
Table 4: Segmentation performance of 2.5D, 3D and ACS convolution models w/ and w/o pretraining on the proof-of-concept dataset. r. denotes randomly initialized. The 2.5D and ACS UNet p. are pretrained on synthetic 2D images.

As shown in Table 4, the performance of ACS UNet w/o pretraining is comparable to that of 3D UNet w/o pretraining, and the ACS UNet with pretraining achieves the best performance. The results indicate that ACS Convolution is an alternative to 3D Convolution with comparable or even better performance, and smaller model size. ACS convolution, as a compact 3D convolution operator, does not lose 3D spatial information. Furthermore, based on the results of 2.5D / ACS UNet r. / p., pretraining is useful to boost task performance, especially when the data scale of the target task is limited ( training samples in this dataset), which is very common in medical image datasets. Thanks to the intrinsic structural superiority, ACS convolutions enable 2D-to-3D transfer learning, which is non-trivial for standard 3D convolutions.

4.2 Lung Nodule Classification and Segmentation

Figure 3: Illustration of the proof-of-concept dataset in this study to perform 3D segmentation with 2D pretraining.


We then validate the effectiveness of the proposed method on a large medical data LIDC-IDRI [armato2011lung], the largest public lung nodule dataset, for both lung nodule segmentation and malignancy classification task. There are lung nodules annotated by at most experts, from CT scans. The annotations include pixel-level labelling of the nodules and -level classification of the malignancy, from “1” (highly benign) to “5” (highly malignant). For segmentation, we choose one of the up to 4 annotations for all cases. For classification, we take the mode of the annotations as its category. In order to reduce ambiguity, we ignore nodules with level-“3” (uncertain labelling) and perform binary classification by categorizing the cases with level “1/2”, “4/5” into class 0, 1. It results in a total of 1,633 nodules for classification. We randomly divide the dataset into for training and evaluation, respectively. At training stage we perform data augmentation including random-center cropping, random-axis rotation and flipping.

Experiment Setting.

We compare the ACS models with 2.5D and 3D counterparts with or without pretraining. The pretrained 2.5D / ACS models are adopted from models in PyTorch’s torchvision package [paszke2017automatic], trained on ImageNet [deng2009imagenet]. For 3D pretraining, we use the official pretrained models by Med3D [chen2019med3d] and Video[hara2018can], while I3D [carreira2017quo] weights are transformed from the 2D ImageNet-pretrained weights as the 2.5D / ACS models. To take advantage of the pretrained weights from Med3D [chen2019med3d] and video [hara2018can] for comparison, all models are adopted a ResNet-18 [he2016deep] architecture, except for Model Genesis [zhou2019models], since the official pretrained model is based on a 3D UNet [cciccek20163d] architecture. For all model training, we use an Adam optimizer [kingma2014adam] with an initial learning rate of and train the model for epochs, and delay the learning rate by after and epochs. For ResNet-18 backbone, in order to keep higher resolution for output feature maps, we modify the stride of first layer ( stride-2 convolution) into

, and remove the first max-pooling. Note that this modification still enables pretraining. A FCN-like

[Long2015FullyCN] decoder is applied with progressive upsampling twice. Dice loss with a batch of is used for segmentation, and binary cross-entropy with a batch of for classification. Dice global and AUC are reported for these two tasks. To demonstrate the flexibility and versatility of ACS convolutions, we also report the results of VGG [Simonyan15] and DenseNet [huang2017densely] with similar experiment setting in the supplementary materials, which is consistent with the ResNet-18 performance.

Result Analysis.

Experiment results are depicted in Table 5. The ACS models consistently outperform all the counterparts by a large margin, including 2.5D and 3D models in both random initialization or pretraining setting. We observe that the 3D models (both ACS and 3D) generally outperform the 2.5 models, indicating that the usefulness of 3D contexts in 3D medical image modeling. Except for the pretrained 2.5D model on classification task, its superior performance over 3D counterparts may explain the prior art [xie2017transferable, liu2019multi] with 2D networks on this dataset. As for pretraining, the ImageNet [deng2009imagenet] provides significant performance boost (see 2.5D p., 3D p. I3D [carreira2017quo] and ACS p.), while Med3D [chen2019med3d] brings limited performance boost. We conjecture that it is owing to the overwhelming data scale and diversity of 2D image dataset. We provide visualization of the lung nodule segmentation for qualitative evaluation in the supplementary materials. Moreover, to investigate the training speed of ACS vs. 3D convolutions, we plot the training curve on the two tasks in Fig. 4. It is observed that ACS p. converges the fastest and best in the 4 models.

Due to the difference on network architecture (ResNet-based FCN vs. UNet), we experiment with the official code of self-supervised pretrained Models Genesis [zhou2019models] with exactly same setting. Even without pretraining, the segmentation and classification performance of the UNet-based models are strong on this dataset. Despite this, the pretrained ACS model is still better performing. Besides, negative transferring is observed for classification by the MG [zhou2019models] encoder-only transferring, whereas the ImageNet pretraining consistently improves the model performance.

Apart from the superior model performance, the ACS model achieves the best parameter efficiency in our experiments. Take the segmentation task for example, the size of ACS model is 49.8 Mb, compared to 49.8 Mb (2.5D), 142.5 Mb (3D) and 65.4 Mb (MG [zhou2019models]).

Models Segmentation Classification
Models Genesis [zhou2019models] r. 75.5 94.3
Models Genesis [zhou2019models] p. 75.9 94.1
2.5D Res-18 r. 68.8 89.4
2.5D Res-18 p. 69.8 92.0
3D Res-18 r. 74.7 90.3
3D Res-18 p. I3D [carreira2017quo] 75.7 91.5
3D Res-18 p. Med3D [chen2019med3d] 74.9 90.6
3D Res-18 p. Video [hara2018can] 75.7 91.0
ACS Res-18 r. 75.1 92.5
ACS Res-18 p. 76.5 94.9
Table 5: LIDC lung nodule segmentation (Dice global) and classification (AUC) performance. The 2.5D, I3D and ACS ResNet-18 p. are pretrained on ImageNet [deng2009imagenet].

4.3 Liver Tumor Segmentation (LiTS) Benchmark

Figure 4: 3D vs. ACS r. / p. training curves of segmentation and classification on LIDC-IDRI dataset. The curves are smoothed with moving average for better visualization.


We further experiment with our approach on LiTS [bilic2019liver], a challenging 3D medical image segmentation dataest. It consists of and enhanced abdominal CT scans for training and testing respectively, to segment the liver and liver tumors. The training annotations are open to public while the test ones are only accessible by online evaluation. The sizes of , axis are , while the sizes of axis are various in the range of . We transpose the axes into to keep the concept consistent as previously mentioned. For pre-processing, we clip the Hounsfield Unit to and then normalize to , without spatial normalization. Training data augmentation includes random-center cropping, random-axis flipping and rotation, and random-scale resampling.

Experiment Setting.

A DeepLabv3+ [chen2018encoder] with a backbone of ResNet-101 [he2016deep] is used in this experiment. The pretrained 2D model is directly obtained from PyTorch’s torchvision package [paszke2017automatic]. The compared baselines are similar to those in the above LIDC experiment (Sec. 4.2). We train all the models for epochs. An Adam optimizer [kingma2014adam] is used with an initial learning rate of , and we decay the learning rate by after and epochs. At training stage, we crop the volumes to the size of . As for testing stage, we crop the volumes to the size of and adopt window sliding at a step of at axis. Dice global and Dice per case of lesion and liver are reported as standard evaluation on this dataset.

Models Lesion Liver
H-DenseUNet [li2018h] 82.4 72.2 96.5 96.1
Models Genesis [zhou2019models]333

The author only releases the pretrained model on chest CTs, thereby we simply report the evaluation metric provided by the paper.

- - - 91.13
2.5D DeepLab r. 72.6 56.7 92.1 91.7
2.5D DeepLab p. 73.3 59.8 91.9 91.0
3D DeepLab r. 75.3 62.2 93.8 93.8
3D DeepLab p. I3D [carreira2017quo] 76.4 57.7 93.1 92.4
3D DeepLab p. Med3D [chen2019med3d] 66.8 53.9 91.0 92.6
3D DeepLab p. Video [hara2018can] 65.2 55.8 91.5 92.2
ACS DeepLab r. 75.2 62.1 94.0 93.9
ACS DeepLab p. 78.0 65.3 94.8 94.8
Table 6: LiTS segmentation performance. DG: Dice global. DPC: Dice per case. “DeepLab” denotes 3D / ACS ResNet-101 followed by 3D / ACS ASPP block [chen2018encoder]. The 2.5D, I3D and ACS DeepLab p. are pretrained on MS-COCO [lin2014microsoft].

Result Analysis.

As shown in Table 6, similar model behavior to LIDC experiment (Sec. 4.2) can be observed. The pretrained ACS DeepLab achieves better performance than the 2D and 3D counterparts (including self-supervised pretraining [zhou2019models]) by a large margin; without pretraining, ACS DeepLab achieves comparable or better performance than 3D DeepLab. According to pretraining results on I3D [carreira2017quo], Med3D [chen2019med3d] and Video [hara2018can]

for 3D DeepLab, negative transferring is observed, probably due to severe domain shift and anisotropy on LiTS dataset. We also report a

state-of-the-art performance on LiTS dataset using H-DenseUNet[li2018h] as a reference. Note that it adopts a completely different training strategy and network architecture (a heavy cascade of 2D and 3D DenseNet-based [huang2017densely] models), thereby it is not suitable to compare to other models directly. In further study, it is feasible to integrate these orthogonal contributions into our models to improve the model performance.

5 Ablation Study

5.1 Analysis of ACS Convolution Variants

We analyze the variants of ACS Convolutions, including Mean-ACS Convolutions and Soft-ACS Convolutions. We test these three methods on LIDC-IDRI dataset, using the same experiment settings and training strategy specified in Sec. 4.2. As depicted in Table 7, the vanilla ACS outperforms its variants in most situations, and pretraining is useful in all cases. Specifically, Mean-ACS is the worst under pretraining setting, due to its inability to distinguish the view-based difference with a symmetric aggregation. Soft-ACS outperforms others in some case (i.e., classification with pretraining), though it consumes more GPU memory and time at the training stage. However, it demonstrates the potential to combine these ACS variants or training strategy (e.g., automatic kernel axis assignment) in further study.

Seg Cls Memory (Seg) Time (Seg)
ACS r. 75.1 92.5 6.6 Gb 0.95 s
M-ACS r. 74.4 89.9 7.8 Gb 1.49 s
S-ACS r. 75.0 89.3 9.9 Gb 1.58 s
ACS p. 76.5 94.9 6.6 Gb 0.95 s
M-ACS p. 75.1 92.7 7.8 Gb 1.49 s
S-ACS p. 75.9 95.1 9.9 Gb 1.58 s
Table 7: A comparison of ACS convolutions and the Mean-ACS and Soft-ACS variants, with / without pretraining, in terms of LIDC segmentation Dice, classification AUC, actual memory and runtime speed per iteration. Memory and time is measured with a batch size of 2, on a single Titan Xp GPU without gradient checkpointing [chen2016training]. The memory consuming differs from the theoretical analysis (Table 3) due to PyTorch internal implementation.

5.2 Whole-Network vs. Encoder-Only Pretraining

A key advantage of the proposed ACS convolution is that it enables flexible whole-network conversion together with the pretrained weights. We thereby validate the superiority of whole-network weight transferring (WN) over encoder-only weight transferring (EO). We train models in different pretraining setting: entirely randomly-initialized (ACS r.), only the pretrained ResNet-101 backbone (ACS p.EO) on ImageNet (IMN) [deng2009imagenet] and MS-COCO (MSC) [lin2014microsoft], and whole pretrained model (ACS p.WN). The results are shown in Table 8. It is observed that with more pretrained weights loaded, the model achieves better performance (p.WNp.EOr.), and the whole-network pretraining achieves the best. Note that although methods like I3D [carreira2017quo], Med3D [chen2019med3d] and Video [hara2018can] provide natively 3D pretrained models, apart from the underperforming performance, these pretraining methods are less flexible and versatile than our method. Generally, only the encoders (backbones) are transferred in previous pretraining methods, however the decoders of state-of-the-art models are also very large in parameter size, e.g., the DeepLabv3+ [chen2018encoder] decoder (ASPP) represents parameters. The previous pretraining methods hardly take care of the scenarios.

Models Size of Lesion Liver
Pretrained Weights DG DPC DG DPC
ACS r. 0 Mb (0%) 75.2 62.1 94.0 93.9
ACS p.EO-IMN 170.0 Mb (72.5%) 75.3 64.3 92.8 92.6
ACS p.EO-MSC 170.0 Mb (72.5%) 76.1 61.6 94.1 93.8
ACS p.WN 234.5 Mb (100%) 78.0 65.3 94.8 94.8
Table 8: LiTS segmentation performance of ACS DeepLab “r.” (initialized randomly), “p.EO-IMN” (encoder-only pretraining on ImageNet [deng2009imagenet]), and “p.EO-MSC” (encoder-only pretraining on MS-COCO [lin2014microsoft]), “p.WN” (whole-network pretraining). The model sizes of pretrained weights out of the whole models are also depicted.

6 Conclusion

We propose ACS convolution for 3D medical images, as a generic and plug-and-play replacement of standard 3D convolution. It enables pretraining from 2D images, which consistently provides singificant performance boost in our experiments. Even without pretraining, the ACS convolution is comparable or even better than 3D convolution, with smaller model size. In further study, we will focus on automatic ACS kernel axis assignment.



Appendix A Details of Proof-of-Concept Dataset

For generating the 2D dataset, we first equally divide a blank 2D image into four pieces. We randomly choose out of the pieces and in each of the selected piece, we generate a random-size circle or square with same probability at random center. The size is limited in the piece. Thereby, the generated shape is guaranteed to be non-overlapped. Similarly, for generating 3D dataset, we equally divide a blank 3D volume into eight pieces. We randomly choose out of the pieces and in each of the selected piece, we generate a random-size cone, pyramid, cube, cylinder or sphere with same probability at random center. The size is limited in the piece. For both 2D and 3D datasets, we add Gaussian noise on each pixel / voxel. See Fig. A1 for samples of the proof-of-concept 2D and 3D dataset.

Figure A1: Samples of the proof-of-concept 2D and 3D datasets. Images in the first row are two 2D samples, while those in the next three rows are three 3D samples. Images with blue figures are input (before adding noise), while images with red figures are target segmentations.

Appendix B More Results on LIDC-IDRI

Apart from the ResNet [he2016deep], we further experiment with the proposed ACS convolutions on LIDC-IDRI lung nodule classification and segmentation task, using VGG [Simonyan15] and DenseNet [huang2017densely]. The experiment settings are exactly the same with ResNet-18, which is stated in the Sec. 4.2. As depicted in Table A1 and A2, the results are consistent with the ResNet-18 performance. The 3D (3D and ACS) models outperform the 2D (2.5D) ones. The randomly-initialized ACS models are comparable or better than the 3D models; when pretrained with 2D datasets (e.g., ImageNet [deng2009imagenet]), the ACS models consistently outperform the 3D ones.

Models Segmentation Classification
2.5D VGG-16 r. 71.0 89.7
2.5D VGG-16 p. 71.6 93.9
3D VGG-16 r. 75.0 91.7
3D VGG-16 p. I3D [carreira2017quo] 75.5 94.0
ACS VGG-16 r. 75.2 94.2
ACS VGG-16 p. 75.8 94.3
Table A1: VGG-16 [Simonyan15] results on LIDC lung nodule segmentation (Dice global) and classification (AUC) performance. The 2.5D, I3D and ACS VGG-16 p. are pretrained on ImageNet [deng2009imagenet].
Models Segmentation Classification
2.5D Dense-121 r. 67.4 87.4
2.5D Dense-121 p. 71.8 92.6
3D Dense-121 r. 73.6 90.0
3D Dense-121 p. I3D [carreira2017quo] 73.6 90.0
ACS Dense-121 r. 73.4 89.2
ACS Dense-121 p. 74.7 92.9
Table A2: DenseNet-121 [huang2017densely] results on LIDC lungs on nodule segmentation (Dice global) and classification (AUC) performance. The 2.5D, I3D and ACS DenseNet-121 p. are pretrained on ImageNet [deng2009imagenet].

Appendix C Qualitative Results on Nodule Segmentation

We visualize the segmentation masks generated by the 2.5D, 3D and ACS ResNet-18, with or without pretraining in Fig. A2. Combined the visualization with overall performance (Table 5), ACS p. segment the target nodules more precisely than other counterparts in general.

Figure A2: Visualization of the segmentation masks generated by the 2.5D, 3D and ACS ResNet-18, with or without pretraining. The number on top of each image indicates the Dice per case of the sample.

Appendix D Implementation of ACS Convolutions

We provide a PyTorch implementation of ACS convolutions, available at Actual memory consuming and runtime speed are reported in Table A3. Using the provided function, 2D CNNs could be converted into ACS CNNs for 3D images, with a single line of code.

import torch
from torchvision.models import resnet18
from acsconv.converters import ACSConverter
# model_2d is a standard PyTorch 2D model
model_2d = resnet18(pretrained=True)
B, C_in, H, W = (1, 3, 64, 64)
input_2d = torch.rand(B, C_in, H, W)
output_2d = model_2d(input_2d)
# model_3d is dealing with 3D data
model_3d = ACSConverter(model_2d)
B, C_in, D, H, W = (1, 3, 64, 64, 64)
input_3d = torch.rand(B, C_in, D, H, W)
output_3d = model_3d(input_3d)
Seg Cls Memory (Seg) Time (Seg)
2D r. 68.8 89.4 5.0 Gb 0.57 s
3D r. 74.7 90.3 5.0 Gb 1.01 s
ACS r. 75.1 92.5 6.6 Gb 0.95 s
Table A3: Model performance, memory consuming and runtime speed of 2D (2.5D) and 3D and ACS convolutions. Due to the engineering issues, the memory of ACS convolutions is large than that of 2D (2.5D) and 3D convolutions, yet theoretically identical (see Table 3). It is expected to be fixed (6.6 Gb to 5.0 Gb) in further implementation by custom memory checkpointing. Even though time complexity of ACS and 2D convolutions is the same, the parallelism of the ACS convolutions is weaker than that of 2D convolutions. Thereby, the actual runtime speed of ACS convolutions is slower than that of 2D convolutions.