AlignShift: Bridging the Gap of Imaging Thickness in 3D Anisotropic Volumes

05/05/2020 ∙ by Jiancheng Yang, et al. ∙ Shanghai Jiao Tong University 0

This paper addresses a fundamental challenge in 3D medical image processing: how to deal with imaging thickness. For anisotropic medical volumes, there is a significant performance gap between thin-slice (mostly 1mm) and thick-slice (mostly 5mm) volumes. Prior arts tend to use 3D approaches for the thin-slice and 2D approaches for the thick-slice, respectively. We aim at a unified approach for both thin- and thick-slice medical volumes. Inspired by recent advances in video analysis, we propose AlignShift, a novel parameter-free operator to convert theoretically any 2D pretrained network into thickness-aware 3D network. Remarkably, the converted networks behave like 3D for the thin-slice, nevertheless degenerate to 2D for the thick-slice adaptively. The unified thickness-aware representation learning is achieved by shifting and fusing aligned "virtual slices" as per the input imaging thickness. Extensive experiments on public large-scale DeepLesion benchmark, consisting of 32K lesions for universal lesion detection, validate the effectiveness of our method, which outperforms previous state of the art by considerable margins, without whistles and bells. More importantly, to our knowledge, this is the first method that bridges the performance gap between thin- and thick-slice volumes by a unified framework. To improve research reproducibility, our code in PyTorch is open source at



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning has been dominating medical image analysis research [16, 20] in a wide range of tasks (e.g., classification [6, 4, 34, 33], segmentation [11, 22], detection [28, 23, 30], registration [1, 3]). However, deployment of the medical image AI systems is still challenging due to numerous difficulties, e.g., open set scenarios [19], calibration and uncertainty quantification [7, 10] in real-world distribution, label ambiguity in clinical annotations [12, 31]. In this study, we focus on a fundamental issue in 3D medical image analysis: how to deal with the imaging thickness, which denotes the physical distance between axial slices. In practice, there exist both thin-slice (mostly 1mm) and thick-slice (mostly 5mm) for a same task, e.g., lesion detection [29], organ and tumor segmentation [21]. Standard procedure treats this issue as pre-processing; spatial normalization is commonly applied to normalize the dataset into a same reference thickness (e.g., 2mm). However, the spatial normalization may amplify unwanted noises in medical images [5]. Fig. 1 depict spatially normalized thin- and thick-slice computed tomography (CT) scans of a same subject. As illustrated, spatial normalization introduces significant artifacts to the thick-slice (note the sagittal and coronal views). If the spatially normalized thin- and thick-slice volumes are processed by a same CNN with standard convolutions, it will lead to domain shift. We conjecture that it is the reason why 3D approaches are preferred for thin-slice volumes, while 2D approaches tend to be superior for thick-slice/anisotropic volumes [11]. Spatial normalization for thick-slice data leads to larger information loss compared to that for thin-slice data. For this reason, we challenge the spatial normalization as a standard pre-processing procedure for 3D medical image processing.

Figure 1: Illustration of spatially normalized thin- and thick-slice computed tomography (CT) scans of a same subject. Data is from our custom dataset. Note that even though there is no significant difference between the axial slices, artifacts (the highlights on the plots) are not neglectable for the sagittal and coronal views in the thick-slice data, which results in domain shift for 3D approaches.

To address the thickness issue, we propose a novel parameter-free operator, AlignShift, to convert theoretically any 2D pretrained network into thickness-aware 3D network. The proposed AlignShift operator is inspired by Temporal Shift Module (TSM) in video analysis [14], which enables temporal (or 3D) information fusion by shifting adjacent slices (details in Sec. 2.1

). Notably, TSM enables 2D-to-3D transfer learning,

i.e., pretrained 3D networks on 2D datasets, which is also highly related to our previous study [32]. Although superior to 2.5D approaches [28, 35, 13], TSM does not bridge the performance gap between thin- and thick-slice volumes (see Sec. 3.3). As a comparison, the AlignShift operator shifts and fuses aligned “virtual slices” as per the input imaging thickness, which results in unified thickness-aware representation learning (details in Sec. 2.2). Remarkably, the AlignShift-converted networks adaptively behave like 3D for the thin-slice, and like 2D for the thick-slice.

We validate the effectiveness of the proposed method on large-scale DeepLesion benchmark [29], a universal lesion detection dataset with 3D inputs and key-slice annotations of 32K lesions. Without whistles and bells, the proposed methods outperform previous state of the art [28, 35, 13] by considerable margins. More importantly, our method closes the performance gap between thin- and thick-slice volumes compared to both 2.5D and TSM approaches; to our knowledge, we are the first to achieve this by a unified framework.

2 Methods

Figure 2: Illustration of Temporal Shift Module (TSM) [14] and the proposed AlignShift. Left

: A 3D tensor (

) with channel , depth , height , width , and are not depicted for simplicity. Middle

: Temporal Shift Module (TSM). The channels are split into three parts for shifting up, shifting down and keeping original. Border slices are padded with zeros.

Right: AlignShift. Instead of shifting physical slices in TSM, we shift “virtual” slices (solid lines in different colors) in AlignShift

. The virtual slices are interpolated from shifted slices (dash lines) by a reference thickness


2.1 Preliminary: Temporal Shift Module (TSM)

Prior arts in 3D images processing utilize pure 2D network to leverage the 2D pretrained weights, while 3D randomly-initialized network is necessarily adopted to fuse the feature into 3D representations. 2.5D representation, i.e., several slices as channels for 2D networks, is insufficient to capture 3D contexts. It is thus meaningful to directly convert a 2D pretrained network into a 3D counterpart. To this end, we introduce Temporal Shift Module [14] (TSM) from the field of video understanding, which enables 2D-to-3D network conversion. To our knowledge, this paper is the first to introduce TSM into medical images with proven effectiveness on DeepLesion [29] benchmark. TSM tries to leverage data shift operation [26] to capture 3D semantics under 2D CNN framework, in which data shift indicates shifting data along a dimension by a certain number of slices. In practice, it is inserted as an operator before a 3D convolution with kernel size . Given a 3D tensor () with channel , depth , height , width , TSM shifts the slices in the depth dimension by slice in one part of the channels, and by slice in another part of the channels, while the rest part of channels remain static (see Figure 2 Middle). Information between slices is fused by channel in this way. To some extent, TSM imitates 3D approaches by slices shifting. It is capable of processing 3D data in an efficient way. However, for medical images, TSM faces the issue of various imaging thickness, which is widespread in many medical image datasets such as DeepLesion [29]. TSM itself does not deal with this issue. Hence all the volumes are supposed to be normalized to the same thickness via interpolation before fed into a TSM model, no matter how large the volume imaging thickness is. To those volumes with thick-slice, the interpolation process could cause large distortion of information, as the highlighted artifacts showed in Figure 1. According to the extensive experiments in Section 3, TSM has good performance over thin-slice volumes, while it declines enormously for thick-slice volumes.

2.2 AlignShift

Input: input 3D feature , input actual thickness , shift channels , reference thickness .
Output: (in-place assignment).
54321 Compute align factor: ; Shift up: ; Obtain virtual slices: ; Shift down: Obtain virtual slices:
Algorithm 1 In-place AlignShift for 3D volumes (zero padding)

We believe that spatial normalization by interpolation induces the performance gap between thin- and thick-slices volumes (see results in Table 3). Domain shift takes place when normalizing thick-slices volumes, which damages the performance. To address the thickness issue, we introduce virtual slices and propose AlignShift which enables adaptive data shift operation based on the given imaging thickness. AlignShift avoids spatial normalization to thick-slice volumes by treating thin- and thick-slice volumes separately. Without loss of generality, we define thick volumes as volumes that have a thickness larger than a reference thickness , and vice versa. For thin-slice volumes, we normalize the thickness to the reference thickness

by interpolation as usual. For thick-slice volumes which could be easily skewed by interpolation, their original thicknesses

are kept. Given a 3D feature tensor of shape with channel , depth , height , width and physical thickness on the depth dimension, AlignShift shifts part of channels up (denoted as ), and another part of channels down (denoted as ) along the depth dimension, while the rest channels remain static. In order to maintain a consistent “receptive field” in physical sense along the depth dimension, it shifts the data by a continuous step, whose step size (align factor) depends on the reference thickness and the volume’s actual thickness . As illustrated in Figure 2, the shifted slice, called “virtual slice”, is obtained by interpolation between the adjacent two slices. See Algorithm 1 for the mathematical formulation. Compared to our method, TSM discretely shifts the data by one full slice. The data shift strategy of TSM results in an inconsistent “receptive field” along the depth dimension in convolution, given the non-unified thickness. Thereby the data shift operation of TSM is so-called “unaligned” under this situation. Contrarily, our method aligns the 3D features of various thickness, allowing the network to learn thickness-aware representations with the same kernels. It is guaranteed by AlignShift that the physical distance between shifted and un-shifted channels is always consistent among volumes with different thickness. AlignShift bridges the performance gap between thin-slice and thick-slick CTs theoretically and empirically.

In practice, AlignShift is simple to use and implement. Similar to TSM, it is inserted as an operator before a 3D convolution. No additional spatial normalization is needed. The original thickness is sent to the network to allow adaptive data shift operation. Compared to TSM, only modest modification is needed to gain a great performance boost. Note that AlignShift is able to capture 3D semantics like TSM, while it could degenerate to 2D for data with extremely large thickness, as a result of the align factor close to zero.

AlignShift serves as a parameter-free operator enabling to convert theoretically any 2D pretrained network into thickness-aware 3D network. The conversion process is straight-forward. Table 1 lists how main operators in 2D CNNs are converted to the counterpart in 3D CNNs.

2.3 3D Network for Universal Lesion Detection on Key Slices

2D Backbone Conv2D Pool2D Norm2D
3D Backbone (AlignShift+)Conv3D Pool3D Norm3D
Table 1: Convert a pretrained 2D backbone into 3D. We use DenseNet-121 [9]; AlignShift is only applied in the dense blocks. denotes the kernel size.
Figure 3: The 3D network for universal lesion detection on DeepLesion [29]. The 3D backbone converted by a truncated DenseNet-121 [9, 28] processes a grey-scale 3D input of , where is the length of slices ( in this study). 2D features of key slices are extracted, and then upsampled in a feature pyramid [15]. Detection is based on instance segmentation by Mask-RCNN [8]. Weak segmentation “ground truth” are generated from weak RECIST labels [35].

We experiment with the proposed method on DeepLesion benchmark [29], which is a large-scale dataset on universal lesion detection. The inputs are 3D slices whereas only 2D key-slice annotations are available. We develop a 3D network with 2D detection heads, based on Mask RCNN [8]. As illustrated in Figure 3, the 3D backbone is converted from the truncated DenseNet-121 [9] with AlignShift, serving as a 3D feature encoder. All the 2D convolutions in the dense blocks are converted to AlignShift 3D convolutions. The encoder takes a grey-scale 3D tensor of as input, where is the length of key slices ( in this study), and extracts 3D features through three dense blocks. Each dense block increases the feature channels, downsamples the feature in the height and width dimension while maintaining the scale in the depth dimension. The feature output of each dense blocks is processed by a 3D convolution and then squeezed into a 2D shape. A 2D decoder combines these features under different resolutions and upsamples the features step by step. The final feature map is fed into RPN head, BBox head and Mask head for detection and instance segmentation supervised by weak RECIST labels.

We implement the Mask-RCNN with PyTorch [17] and MMDetection [2]. As counterparts of the proposed AlignShift, we also implement 1) 2.5D Mask-RCNN, which stacks the slices as input channels for standard 2D networks, and 2) TSM-converted Mask-RCNN, which uses TSM instead of AlignShift. Note that the model sizes of the TSM models are strictly the same as the AlignShiftmodels, and the 2.5 models are slightly smaller because of the Conv3D () layer.

3 Experiments

3.1 Dataset and Experiment Setting

DeepLesion dataset [29] consists of 32,120 axial CT slices from 10,594 studies of unique patients. the thickness of the dataset is almost 1mm (48.66%) and 5mm (50.26%), which is appropriate to develop and evaluate the proposed method. There are 1 to 3 lesions in each slice, with totally 32,735 lesions from several organs, whose sizes vary from 0.21 to 342.5mm. RECIST diameter coordinates and bounding boxes were labeled on the key slices, with adjacent slices (above and below 30mm) provided as contextual information. We use GrabCut [18] to generate weak segmentation “ground truth” from weak RECIST labels as in Improved RetinaNet [35]. Hounsfield units of the input are clipped into and normalized. In AlignShift experiments, is regarded as the reference thickness; we normalize the thickness into 2mm for thin-slice data with thickness2mm; for thick-slice data with thickness2mm, we keep the original thickness, since the spatial normalization for thick-slice data leads to larger information loss compared to that for thin-slice data. For 2.5D and TSM counterparts, we adapt standard strategy to normalize the thickness of all images into 2mm. Data augmentation including horizontal flip, shift, rescaling and rotation is applied during training stage, no test-time augmentation (TTA) is applied. We resize each input slice to before feeding into the networks. We use official data split (training/validation/test: 70%/15%/15%); following prior studies [28, 35, 13], sensitivity at various false positives levels (i.e., FROC analysis) is evaluated on the test set.

Methods Slices 0.5 1 2 4 8 16 Avg.[0.5,1,2,4]
3DCE [27] MICCAI’18 62.48 73.37 80.70 85.65 89.09 91.06 75.55
ULDor [24] ISBI’19 52.86 64.8 74.84 84.38 87.17 91.8 69.22
V.Attn [25] MICCAI’19 69.10 77.90 83.80 - - - -
Retina. [35] MICCAI’19 72.15 80.07 86.40 90.77 94.09 96.32 82.35
MVP [13] MICCAI’19 70.01 78.77 84.71 89.03 - - 80.63
MVP [13] MICCAI’19 73.83 81.82 87.60 91.30 - - 83.64
MULAN [28] MICCAI’19 76.12 83.69 88.76 92.30 94.71 95.64 85.22
w/o SRL [28] MICCAI’19 - - - - - - 84.22
Ours 2.5D 71.27 79.82 86.30 90.61 93.75 95.70 82.00
Ours 2.5D 72.66 81.45 87.07 90.98 93.40 95.30 83.04
Ours TSM 71.80 80.11 86.97 91.10 93.75 95.56 82.50
Ours TSM 75.98 83.65 88.44 92.14 94.89 96.50 85.05
Ours AlignShift 73.00 81.17 87.05 91.78 94.63 95.48 83.25
Ours AlignShift 78.68 85.69 90.37 93.49 95.48 97.05 87.06
Table 2: Sensitivity (%) at various false positives (FPs) per image of previous state-of-the-art and the proposed methods, on the large-scale DeepLesion benchmark [29]. Note that MULAN [28] uses extra tag supervision and an addition Score Refinement Layer (SRL) with tag inputs; We report the performance of MULAN under 171-tag supervision as well as that without SRL.

3.2 Performance Compared with State of the Art

In Table 2, we depict sensitivity at various false positives per image (FPs), which shows that our proposed methods significantly outperform the previous state-of-the-art MULAN [28]. Notably, it is achieved without additional information beyond the CT images such as tags from medical reports and demographic information. The performance of our 2.5D counterparts is comparable to Improved RetinaNet [35] and MVP-Net [13]. The proposed TSM-converted networks outperform these studies [35, 13], and even MULAN without tag supervision, which validates the superiority of pretrained 3D backbones over 2.5D. AlignShift  further boosts performance and surpasses our TSM and previous state-of-the-art MULAN [28]. Due to memory constraints, we report the performance of maximum 7 slices, whereas more slices are expected with better performance.

Methods Thinkness 0.5 1 2 4 8 16 Avg.[0.5,1,2,4] diff.
2.5D All 71.27 79.82 86.30 90.61 93.75 95.70 82.00 -
Thin 72.78 80.65 87.21 90.94 93.97 95.86 82.89
Thick 69.88 79.16 85.51 90.48 93.65 95.65 81.26
2.5D All 72.66 81.45 87.07 90.98 93.40 95.30 83.04 -
Thin 75.77 83.93 88.85 92.37 94.26 95.78 85.23
Thick 69.76 78.96 85.75 90.03 92.67 94.99 81.13
TSM All 71.80 80.11 86.97 91.10 93.75 95.56 82.50 -
Thin 74.19 81.75 88.11 92.17 94.42 96.15 84.06
Thick 69.72 78.55 86.08 90.27 93.08 94.91 81.16
TSM All 75.98 83.65 88.44 92.14 94.89 96.50 85.05 -
Thin 78.76 85.53 89.67 93.48 95.61 96.68 86.86
Thick 73.26 81.97 87.10 90.96 94.14 96.42 83.32
AlignShift  All 73.00 81.17 87.05 91.78 94.63 95.48 83.25 -
Thin 73.43 81.26 87.08 91.92 94.87 95.65 83.43
Thick 72.85 81.20 87.14 91.78 94.38 95.28 83.24
AlignShift  All 78.68 85.69 90.37 93.49 95.48 97.05 87.06 -
Thin 79.64 86.55 90.73 94.18 95.82 97.17 87.78
Thick 78.06 85.10 89.99 92.76 95.08 96.91 86.48
Table 3: Detection performance analysis of our methods on all, thin-slice (mostly 1mm) and thick-slice (mostly 5mm) data. “diff.” denotes the average sensitivity (Avg.[0.5,1,2,4]) difference between thin/thick-slice data and that of all data.

3.3 Performance Analysis on Thin-Slice and Thick-Slice Data

To demonstrate the benefits of AlignShift on bridging the performance gap of thin- and thick-slice data, we conducted a performance analysis of our 2.5D, TSM and AlignShift models on the thin- and thick-slice data separately. As there is no open trained models available for prior state of the art [28, 35, 13], we believe our 2.5D models represent the performance of these studies, considering these studies follow 2.5D fashion. As illustrated in Table 3, there is a significant performance gap between thin- and thick-slice CTs for 2.5D and TSM approaches, which validates our argument (Sec. 1) that there is a significant domain shift if the thin- and thick-slice data are processed by a same CNN with standard convolutions. Note that the gap becomes larger when more slices are used, since the 3D information is less valuable for thick-slice data than the thin-slice. In comparison, by learning unified thickness-aware representation, the proposed AlignShift  reduces the performance gap to neglectable for 3-slice setting; even for the 7-slice setting, the gap is closed compared to 2.5D and TSM approaches.

4 Conclusion

In this study, we challenge spatial normalization as a standard pre-processing approach to solve thickness issue in 3D medical images. Our experiment results indicate that both 2.5D and 3D (i.e., TSM in this study) approaches do not fundamentally address the domain shift issue introduced by spatial normalization, which results in a significant performance gap between thin- and thick-slice data. In this regard, we propose a novel parameter-free operator AlignShift, which enables us to convert theoretically any 2D pretrained network into thickness-aware 3D network. Extensive experiments on DeepLesion benchmark empirically validate that our methods bridge the gap between thin- and thick-slice data. Without whistles and bells, we establish a new state of the art on DeepLesion, which surpasses prior arts by considerable margins.