Deep learning has been dominating medical image analysis research [16, 20] in a wide range of tasks (e.g., classification [6, 4, 34, 33], segmentation [11, 22], detection [28, 23, 30], registration [1, 3]). However, deployment of the medical image AI systems is still challenging due to numerous difficulties, e.g., open set scenarios , calibration and uncertainty quantification [7, 10] in real-world distribution, label ambiguity in clinical annotations [12, 31]. In this study, we focus on a fundamental issue in 3D medical image analysis: how to deal with the imaging thickness, which denotes the physical distance between axial slices. In practice, there exist both thin-slice (mostly 1mm) and thick-slice (mostly 5mm) for a same task, e.g., lesion detection , organ and tumor segmentation . Standard procedure treats this issue as pre-processing; spatial normalization is commonly applied to normalize the dataset into a same reference thickness (e.g., 2mm). However, the spatial normalization may amplify unwanted noises in medical images . Fig. 1 depict spatially normalized thin- and thick-slice computed tomography (CT) scans of a same subject. As illustrated, spatial normalization introduces significant artifacts to the thick-slice (note the sagittal and coronal views). If the spatially normalized thin- and thick-slice volumes are processed by a same CNN with standard convolutions, it will lead to domain shift. We conjecture that it is the reason why 3D approaches are preferred for thin-slice volumes, while 2D approaches tend to be superior for thick-slice/anisotropic volumes . Spatial normalization for thick-slice data leads to larger information loss compared to that for thin-slice data. For this reason, we challenge the spatial normalization as a standard pre-processing procedure for 3D medical image processing.
To address the thickness issue, we propose a novel parameter-free operator, AlignShift, to convert theoretically any 2D pretrained network into thickness-aware 3D network. The proposed AlignShift operator is inspired by Temporal Shift Module (TSM) in video analysis , which enables temporal (or 3D) information fusion by shifting adjacent slices (details in Sec. 2.1
). Notably, TSM enables 2D-to-3D transfer learning,i.e., pretrained 3D networks on 2D datasets, which is also highly related to our previous study . Although superior to 2.5D approaches [28, 35, 13], TSM does not bridge the performance gap between thin- and thick-slice volumes (see Sec. 3.3). As a comparison, the AlignShift operator shifts and fuses aligned “virtual slices” as per the input imaging thickness, which results in unified thickness-aware representation learning (details in Sec. 2.2). Remarkably, the AlignShift-converted networks adaptively behave like 3D for the thin-slice, and like 2D for the thick-slice.
We validate the effectiveness of the proposed method on large-scale DeepLesion benchmark , a universal lesion detection dataset with 3D inputs and key-slice annotations of 32K lesions. Without whistles and bells, the proposed methods outperform previous state of the art [28, 35, 13] by considerable margins. More importantly, our method closes the performance gap between thin- and thick-slice volumes compared to both 2.5D and TSM approaches; to our knowledge, we are the first to achieve this by a unified framework.
2.1 Preliminary: Temporal Shift Module (TSM)
Prior arts in 3D images processing utilize pure 2D network to leverage the 2D pretrained weights, while 3D randomly-initialized network is necessarily adopted to fuse the feature into 3D representations. 2.5D representation, i.e., several slices as channels for 2D networks, is insufficient to capture 3D contexts. It is thus meaningful to directly convert a 2D pretrained network into a 3D counterpart. To this end, we introduce Temporal Shift Module  (TSM) from the field of video understanding, which enables 2D-to-3D network conversion. To our knowledge, this paper is the first to introduce TSM into medical images with proven effectiveness on DeepLesion  benchmark. TSM tries to leverage data shift operation  to capture 3D semantics under 2D CNN framework, in which data shift indicates shifting data along a dimension by a certain number of slices. In practice, it is inserted as an operator before a 3D convolution with kernel size . Given a 3D tensor () with channel , depth , height , width , TSM shifts the slices in the depth dimension by slice in one part of the channels, and by slice in another part of the channels, while the rest part of channels remain static (see Figure 2 Middle). Information between slices is fused by channel in this way. To some extent, TSM imitates 3D approaches by slices shifting. It is capable of processing 3D data in an efficient way. However, for medical images, TSM faces the issue of various imaging thickness, which is widespread in many medical image datasets such as DeepLesion . TSM itself does not deal with this issue. Hence all the volumes are supposed to be normalized to the same thickness via interpolation before fed into a TSM model, no matter how large the volume imaging thickness is. To those volumes with thick-slice, the interpolation process could cause large distortion of information, as the highlighted artifacts showed in Figure 1. According to the extensive experiments in Section 3, TSM has good performance over thin-slice volumes, while it declines enormously for thick-slice volumes.
We believe that spatial normalization by interpolation induces the performance gap between thin- and thick-slices volumes (see results in Table 3). Domain shift takes place when normalizing thick-slices volumes, which damages the performance. To address the thickness issue, we introduce virtual slices and propose AlignShift which enables adaptive data shift operation based on the given imaging thickness. AlignShift avoids spatial normalization to thick-slice volumes by treating thin- and thick-slice volumes separately. Without loss of generality, we define thick volumes as volumes that have a thickness larger than a reference thickness , and vice versa. For thin-slice volumes, we normalize the thickness to the reference thickness
by interpolation as usual. For thick-slice volumes which could be easily skewed by interpolation, their original thicknessesare kept. Given a 3D feature tensor of shape with channel , depth , height , width and physical thickness on the depth dimension, AlignShift shifts part of channels up (denoted as ), and another part of channels down (denoted as ) along the depth dimension, while the rest channels remain static. In order to maintain a consistent “receptive field” in physical sense along the depth dimension, it shifts the data by a continuous step, whose step size (align factor) depends on the reference thickness and the volume’s actual thickness . As illustrated in Figure 2, the shifted slice, called “virtual slice”, is obtained by interpolation between the adjacent two slices. See Algorithm 1 for the mathematical formulation. Compared to our method, TSM discretely shifts the data by one full slice. The data shift strategy of TSM results in an inconsistent “receptive field” along the depth dimension in convolution, given the non-unified thickness. Thereby the data shift operation of TSM is so-called “unaligned” under this situation. Contrarily, our method aligns the 3D features of various thickness, allowing the network to learn thickness-aware representations with the same kernels. It is guaranteed by AlignShift that the physical distance between shifted and un-shifted channels is always consistent among volumes with different thickness. AlignShift bridges the performance gap between thin-slice and thick-slick CTs theoretically and empirically.
In practice, AlignShift is simple to use and implement. Similar to TSM, it is inserted as an operator before a 3D convolution. No additional spatial normalization is needed. The original thickness is sent to the network to allow adaptive data shift operation. Compared to TSM, only modest modification is needed to gain a great performance boost. Note that AlignShift is able to capture 3D semantics like TSM, while it could degenerate to 2D for data with extremely large thickness, as a result of the align factor close to zero.
AlignShift serves as a parameter-free operator enabling to convert theoretically any 2D pretrained network into thickness-aware 3D network. The conversion process is straight-forward. Table 1 lists how main operators in 2D CNNs are converted to the counterpart in 3D CNNs.
2.3 3D Network for Universal Lesion Detection on Key Slices
We experiment with the proposed method on DeepLesion benchmark , which is a large-scale dataset on universal lesion detection. The inputs are 3D slices whereas only 2D key-slice annotations are available. We develop a 3D network with 2D detection heads, based on Mask RCNN . As illustrated in Figure 3, the 3D backbone is converted from the truncated DenseNet-121  with AlignShift, serving as a 3D feature encoder. All the 2D convolutions in the dense blocks are converted to AlignShift 3D convolutions. The encoder takes a grey-scale 3D tensor of as input, where is the length of key slices ( in this study), and extracts 3D features through three dense blocks. Each dense block increases the feature channels, downsamples the feature in the height and width dimension while maintaining the scale in the depth dimension. The feature output of each dense blocks is processed by a 3D convolution and then squeezed into a 2D shape. A 2D decoder combines these features under different resolutions and upsamples the features step by step. The final feature map is fed into RPN head, BBox head and Mask head for detection and instance segmentation supervised by weak RECIST labels.
We implement the Mask-RCNN with PyTorch  and MMDetection . As counterparts of the proposed AlignShift, we also implement 1) 2.5D Mask-RCNN, which stacks the slices as input channels for standard 2D networks, and 2) TSM-converted Mask-RCNN, which uses TSM instead of AlignShift. Note that the model sizes of the TSM models are strictly the same as the AlignShiftmodels, and the 2.5 models are slightly smaller because of the Conv3D () layer.
3.1 Dataset and Experiment Setting
DeepLesion dataset  consists of 32,120 axial CT slices from 10,594 studies of unique patients. the thickness of the dataset is almost 1mm (48.66%) and 5mm (50.26%), which is appropriate to develop and evaluate the proposed method. There are 1 to 3 lesions in each slice, with totally 32,735 lesions from several organs, whose sizes vary from 0.21 to 342.5mm. RECIST diameter coordinates and bounding boxes were labeled on the key slices, with adjacent slices (above and below 30mm) provided as contextual information. We use GrabCut  to generate weak segmentation “ground truth” from weak RECIST labels as in Improved RetinaNet . Hounsfield units of the input are clipped into and normalized. In AlignShift experiments, is regarded as the reference thickness; we normalize the thickness into 2mm for thin-slice data with thickness2mm; for thick-slice data with thickness2mm, we keep the original thickness, since the spatial normalization for thick-slice data leads to larger information loss compared to that for thin-slice data. For 2.5D and TSM counterparts, we adapt standard strategy to normalize the thickness of all images into 2mm. Data augmentation including horizontal flip, shift, rescaling and rotation is applied during training stage, no test-time augmentation (TTA) is applied. We resize each input slice to before feeding into the networks. We use official data split (training/validation/test: 70%/15%/15%); following prior studies [28, 35, 13], sensitivity at various false positives levels (i.e., FROC analysis) is evaluated on the test set.
|3DCE  MICCAI’18||62.48||73.37||80.70||85.65||89.09||91.06||75.55|
|ULDor  ISBI’19||52.86||64.8||74.84||84.38||87.17||91.8||69.22|
|V.Attn  MICCAI’19||69.10||77.90||83.80||-||-||-||-|
|Retina.  MICCAI’19||72.15||80.07||86.40||90.77||94.09||96.32||82.35|
|MVP  MICCAI’19||70.01||78.77||84.71||89.03||-||-||80.63|
|MVP  MICCAI’19||73.83||81.82||87.60||91.30||-||-||83.64|
|MULAN  MICCAI’19||76.12||83.69||88.76||92.30||94.71||95.64||85.22|
|w/o SRL  MICCAI’19||-||-||-||-||-||-||84.22|
3.2 Performance Compared with State of the Art
In Table 2, we depict sensitivity at various false positives per image (FPs), which shows that our proposed methods significantly outperform the previous state-of-the-art MULAN . Notably, it is achieved without additional information beyond the CT images such as tags from medical reports and demographic information. The performance of our 2.5D counterparts is comparable to Improved RetinaNet  and MVP-Net . The proposed TSM-converted networks outperform these studies [35, 13], and even MULAN without tag supervision, which validates the superiority of pretrained 3D backbones over 2.5D. AlignShift further boosts performance and surpasses our TSM and previous state-of-the-art MULAN . Due to memory constraints, we report the performance of maximum 7 slices, whereas more slices are expected with better performance.
3.3 Performance Analysis on Thin-Slice and Thick-Slice Data
To demonstrate the benefits of AlignShift on bridging the performance gap of thin- and thick-slice data, we conducted a performance analysis of our 2.5D, TSM and AlignShift models on the thin- and thick-slice data separately. As there is no open trained models available for prior state of the art [28, 35, 13], we believe our 2.5D models represent the performance of these studies, considering these studies follow 2.5D fashion. As illustrated in Table 3, there is a significant performance gap between thin- and thick-slice CTs for 2.5D and TSM approaches, which validates our argument (Sec. 1) that there is a significant domain shift if the thin- and thick-slice data are processed by a same CNN with standard convolutions. Note that the gap becomes larger when more slices are used, since the 3D information is less valuable for thick-slice data than the thin-slice. In comparison, by learning unified thickness-aware representation, the proposed AlignShift reduces the performance gap to neglectable for 3-slice setting; even for the 7-slice setting, the gap is closed compared to 2.5D and TSM approaches.
In this study, we challenge spatial normalization as a standard pre-processing approach to solve thickness issue in 3D medical images. Our experiment results indicate that both 2.5D and 3D (i.e., TSM in this study) approaches do not fundamentally address the domain shift issue introduced by spatial normalization, which results in a significant performance gap between thin- and thick-slice data. In this regard, we propose a novel parameter-free operator AlignShift, which enables us to convert theoretically any 2D pretrained network into thickness-aware 3D network. Extensive experiments on DeepLesion benchmark empirically validate that our methods bridge the gap between thin- and thick-slice data. Without whistles and bells, we establish a new state of the art on DeepLesion, which surpasses prior arts by considerable margins.
-  Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging (2019)
-  Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Dalca, A.V., Balakrishnan, G., Guttag, J., Sabuncu, M.R.: Unsupervised learning for fast probabilistic diffeomorphic registration. In: MICCAI. pp. 729–738. Springer (2018)
Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature542(7639), 115 (2017)
Glocker, B., Robinson, R., de Castro, D.C., Dou, Q., Konukoglu, E.: Machine learning with multi-site imaging data: An empirical study on the impact of scanner effects. In: Medical Imaging meets NeurIPS Workshop (2019)
-  Gulshan, V., Peng, L., Coram, M., Stumpe, M.C., Wu, D., Narayanaswamy, A., Venugopalan, S., Widner, K., Madams, T., Cuadros, J., et al.: Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. Jama 316(22), 2402–2410 (2016)
-  Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.: On calibration of modern neural networks. In: ICML (2017)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. ICCV pp. 2980–2988 (2017)
-  Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. vol. 1, p. 3 (2017)
-  Huang, X., Yang, J., Li, L., Deng, H., Ni, B., Xu, Y.: Evaluating and boosting uncertainty quantification in classification. arXiv preprint arXiv:1909.06030 (2019)
-  Isensee, F., Petersen, J., Klein, A., Zimmerer, D., Jaeger, P.F., Kohl, S., Wasserthal, J., Koehler, G., Norajitra, T., Wirkert, S., et al.: nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486 (2018)
-  Kohl, S., Romera-Paredes, B., Meyer, C., De Fauw, J., Ledsam, J.R., Maier-Hein, K., Eslami, S.A., Rezende, D.J., Ronneberger, O.: A probabilistic u-net for segmentation of ambiguous images. In: NIPS. pp. 6965–6975 (2018)
-  Li, Z., Zhang, S., Zhang, J., Huang, K., Wang, Y., Yu, Y.: Mvp-net: Multi-view fpn with position-aware attention for deep universal lesion detection. In: MICCAI. pp. 13–21. Springer (2019)
-  Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. ICCV pp. 7082–7092 (2019)
-  Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature pyramid networks for object detection. CVPR pp. 936–944 (2016)
-  Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M., Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learning in medical image analysis. Medical image analysis 42, 60–88 (2017)
-  Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., Devito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
-  Rother, C., Kolmogorov, V., Blake, A.: ” grabcut” interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG) 23(3), 309–314 (2004)
-  Scheirer, W.J., Rocha, A., Sapkota, A., Boult, T.E.: Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1757–1772 (2013)
-  Shen, D., Wu, G., Suk, H.I.: Deep learning in medical image analysis. Annual review of biomedical engineering 19, 221–248 (2017)
-  Simpson, A.L., Antonelli, M., Bakas, S., Bilello, M., Farahani, K., van Ginneken, B., Kopp-Schneider, A., Landman, B.A., Litjens, G.J.S., Menze, B.H., Ronneberger, O., Summers, R.M., Bilic, P., Christ, P.F., Do, R.K.G., Gollub, M., Golia-Pernicka, J., Heckers, S., Jarnagin, W.R., McHugo, M., Napel, S., Vorontsov, E., Maier-Hein, L., Cardoso, M.J.: A large annotated medical image dataset for the development and evaluation of segmentation algorithms. ArXiv abs/1902.09063 (2019)
-  Tang, H., Chen, X., Liu, Y., Lu, Z., You, J., Yang, M., Yao, S., Zhao, G., Xu, Y., Chen, T., et al.: Clinically applicable deep learning framework for organs at risk delineation in ct images. Nature Machine Intelligence pp. 1–12 (2019)
-  Tang, H., Zhang, C., Xie, X.: Nodulenet: Decoupled false positive reduction for pulmonary nodule detection and segmentation. In: MICCAI. pp. 266–274. Springer (2019)
-  Tang, Y.B., Yan, K., Tang, Y.X., Liu, J., Xiao, J., Summers, R.M.: Uldor: a universal lesion detector for ct scans with pseudo masks and hard negative example mining. In: ISBI. pp. 833–836. IEEE (2019)
-  Wang, X., Han, S., Chen, Y., Gao, D., Vasconcelos, N.: Volumetric attention for 3d medical image segmentation and detection. In: MICCAI. pp. 175–184. Springer (2019)
-  Wu, B., Wan, A., Yue, X., Jin, P.H., Zhao, S., Golmant, N., Gholaminejad, A., Gonzalez, J., Keutzer, K.: Shift: A zero flop, zero parameter alternative to spatial convolutions. CVPR pp. 9127–9135 (2017)
Yan, K., Bagheri, M., Summers, R.M.: 3d context enhanced region-based convolutional neural network for end-to-end lesion detection. In: MICCAI. pp. 511–519. Springer (2018)
-  Yan, K., Tang, Y., Peng, Y., Sandfort, V., Bagheri, M., Lu, Z., Summers, R.M.: Mulan: Multitask universal lesion analysis network for joint lesion detection, tagging, and segmentation. In: MICCAI (2019)
-  Yan, K., Wang, X., Lu, L., Zhang, L., Harrison, A.P., Bagheri, M., Summers, R.M.: Deep lesion graphs in the wild: relationship learning and organization of significant radiology image findings in a diverse large-scale lesion database. In: CVPR. pp. 9261–9270 (2018)
-  Yang, J., Deng, H., Huang, X., Ni, B., Xu, Y.: Relational learning between multiple pulmonary nodules via deep set attention transformers. In: ISBI (2020)
-  Yang, J., Fang, R., Ni, B., Li, Y., Xu, Y., Li, L.: Probabilistic radiomics: Ambiguous diagnosis with controllable shape analysis. In: MICCAI. pp. 658–666. Springer (2019)
-  Yang, J., Huang, X., Ni, B., Xu, J., Yang, C., Xu, G.: Reinventing 2d convolutions for 3d medical images. arXiv preprint arXiv:1911.10477 (2019)
-  Zhao, W., Yang, J., Ni, B., Bi, D., Sun, Y., Xu, M., Zhu, X., Li, C., Jin, L., Gao, P., Wang, P., Hua, Y., Li, M.: Toward automatic prediction of egfr mutation status in pulmonary adenocarcinoma with 3d deep learning. Cancer Medicine (2019)
-  Zhao, W., Yang, J., Sun, Y., Li, C., Wu, W., Jin, L., Yang, Z., Ni, B., Gao, P., Wang, P., Hua, Y., Li, M.: 3d deep learning from ct scans predicts tumor invasiveness of subcentimeter pulmonary adenocarcinomas. Cancer Research 78(24), 6881–6889 (2018)
-  Zlocha, M., Dou, Q., Glocker, B.: Improving retinanet for ct lesion detection with dense masks from weak recist labels. In: MICCAI. pp. 402–410. Springer (2019)