Joint Unsupervised Learning for the Vertebra Segmentation, Artifact Reduction and Modality Translation of CBCT Images

01/02/2020 ∙ by Yuanyuan Lyu, et al. ∙ 19

We investigate the unsupervised learning of the vertebra segmentation, artifact reduction and modality translation of CBCT images. To this end, we formulate this problem under a unified framework that jointly addresses these three tasks and intensively leverages the knowledge sharing. The unsupervised learning of this framework is enabled by 1) a novel shape-aware artifact disentanglement network that supports different forms of image synthesis and vertebra segmentation and 2) a deliberate fusion of knowledge from an independent CT dataset. Specifically, the proposed framework takes a random pair of CBCT and CT images as the input, and manipulates the synthesis and segmentation via different combinations of the decodings of the disentangled latent codes. Then, by discovering various forms of consistencies between the synthesized images and segmented , the learning is achieved via self-learning from the given CBCT and CT images obviating the need for the paired (i.e., anatomically identical) groundtruth data. Extensive experiments on clinical CBCT and CT datasets show that the proposed approach performs significantly better than other state-of-the-art unsupervised methods trained independently for each task and, remarkably, the proposed approach achieves a dice coefficient of 0.879 for unsupervised CBCT vertebra segmentation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cone-beam computed tomography (CBCT) has been widely used in spinal surgery as an intraoperative 3D imaging modality to guide the intervention. However, compared with the conventional computed tomography (CT), intraoperative CBCT images have more pronounced noise and poor tissue contrast [20, 21]. Moreover, it is also common to have metallic objects presented during the imaging, which introduces metal artifacts and further degrades the quality of CBCT images [28, 18]. On the other hand, identifying the vertebrae is of great importance to the spinal surgery. The poor CBCT image quality makes it challenging to delineate the vertebra shape and, thus, compromises the intervention.

Figure 1: Sample images from the CBCT domain, from the CT domain and the corresponding vertebra ground truth of . Due to the lack of groundtruth high-quality CBCT image and vertebra annotations, learning to improve the CBCT image quality and segment the CBCT vertebrae is challenging. We propose to leverage the knowledge from CT (both image and shape) to address this challenge.

To address the problem, this study aims to design a computational method to improve the quality of the CBCT images and facilitate the delineation of vertebrae. At the same time, this study also aims to design a vertebra segmentation method to automate the delineation and leverages this task to, in turn, improve the CBCT image quality. For either task, we find it clinically impractical to train the model in a supervised manner. First, it is difficult to find a pair of anatomically identical CBCT images, one is low quality and the other is high quality. A common practice [27, 25] is to synthesize artifacts in high-quality images to create the anatomically paired data. However, models trained with synthesized data usually generalize poorly to clinical images. Second, it is also challenging to manually annotate the CBCT images due to the low quality. As shown in Fig.1, the vertebrae in CBCT images may be corrupted and the annotation requires a great deal of expertise and label work.

Therefore, this study resorts to develop an unsupervised method to improve the image quality and segment the vertebrae of CBCT images. To enable the unsupervised learning, we notice that it might be possible to learn from an independent CT image dataset. On the one hand, CT images are generally with high quality (e.g., less pronounced artifacts, high contrast, and high signal-to-noise ratio) and the anatomical details can be well observed. On the other hand, annotating the CT images is relatively cheaper and there are several public spinal CT datasets with segmentation labels available

111http://spineweb.digitalimaginggroup.ca/. For the former, we may design a model that learns from the CT image domain about how high-quality images should look like and apply the knowledge to improve the quality of CBCT images. For the latter, we may design a model that learns from the vertebra segmentation of CT images and apply the knowledge to segment CBCT vertebrae. Learning these two models together essentially defines an unsupervised vertebra segmentation, artifact reduction, and modality transfer problem.

Instead of developing models to independently address each task, we formulate this problem under a unified framework that jointly addresses these three tasks and intensively leverages the knowledge sharing. In particular, we propose a novel shape-aware artifact disentanglement network that 1) supports different forms of image synthesis and vertebra segmentation, 2) shares the knowledge learned from different synthesis and segmentation tasks, and 3) discovers different forms of consistencies between the inputs and outputs. Specifically, given a random pair of CBCT and CT images, the proposed framework encodes disentangled representations and manipulates the synthesis and segmentation via different combinations of the decodings. The encoders, decoders and segmentors are reapplied during the synthesis and segmentation to share the knowledge. Then, by discovering various forms of consistencies between the synthesized images and segmented vertebrae, self-learning from the given CBCT and CT images is achieved, and no paired data for the CBCT is necessary.

In summary, the contributions of this work are threefold:

  • By utilizing disentangled representations and anatomical knowledge from the target domain, we introduce a unified framework for unsupervised vertebra segmentation, artifact reduction, and modality translation. These tasks benefit from each other through the joint unsupervised learning.

  • We propose a novel shape-aware artifact disentanglement network that supports different forms of image synthesis and vertebra segmentation and discovers different forms of consistencies between the inputs and outputs to enable the unsupervised learning.

  • We propose to use a shape-aware normalization layer to explicitly fuse the anatomical information, learned through shape consistency, into the decoder and boost the image synthesis performance.

2 Related Works

Our approach is related to multiple lines of research. Below we briefly introduce related literature of unpaired image-to-image translation, the use of normalization in style transfer, and unsupervised segmentation.

2.1 Unpaired Image-to-Image Translation

Isola et al. [10] introduce a simple framework with conditional GANs as structural loss for image-to-image translation with paired data. CycleGAN [30] and UNIT [14] extend the work to unpaired images with a cycle-consistent mechanism. MUNIT [8] and DRIT [12] further embed different images into a common content space and manipulate the domain-specific attributes to generate diverse images. Recently, Liao et al. [13] propose a novel artifact disentanglement network (ADN) with specialized encoders and decoders handling metal artifacts and achieve great MAR performance on clinical data including CBCT images. However, image synthesis without explicit anatomy constrain may lead to inconsistent anatomical structure which is dangerous in medical area. Zhang et al. [29] use the anatomical information of both domains to ensure structural invariance in modality translation. Their framework can not be applied in our circumstance because only segmentations from CT domain is available.

2.2 Normalization in Style Transfer

Ioffe et al. [9]

first introduce batch normalization (BN), which eases the training of DNNs. While in style transfer, replacing BN by instance normalization (IN) could significantly improve the image generation performance

[23] and IN is suggested to perform some kind of style normalization [7]. Later on, conditional normalization layers are designed [3, 7, 16]. The new layers first normalized activations to zero mean and unit deviation and then denormalize them with learned affine transformation from external data. In this way, conditional BN [3] and Adaptive IN (AdaIN) [7] control the style globally and SPatially Adaptive DEnormalization (SPADE) [16] modulates the activations spatially. SPADE shows advantages over other normalization layers in image synthesis. As the semantic mask used to learn spatial transformation is not available here, we propose to adapt SPADE with shape information extracted from disentangled feature representation.

2.3 Unsupervised Segmentation

Annotations for medical images are time consuming with trained experts and such data often come from different domains (i.e. different modalities, protocols, sites). A model trained on source domain hardly works well on target domain. To solve the problem, recently works learn domain-invariant features and anatomical prior by adversarial networks [11, 2] or apply segmentation after image translation as an ad-hoc approach [24]. However, when domain adaption or synthesis fails, the segmentation algorithm can not output correct results.

3 Methodology

Figure 2: Architecture of the proposed shape-aware artifact disentangled network.

Let be the domain of low-quality CBCT images, be the domain of high-quality CT images, and be the domain of vertebra shapes. A CBCT image is usually noisy and may contain streak artifacts while a CT image are artifact-free and provide more anatomical details. A vertebra shape can be presented as a binary segmentation mask where indicates whether a pixel at location belongs to a vertebra. Figure 1 shows samples from these three domains.

The proposed approach aims to improve the quality of images in as well as learn their vertebra shapes under an unsupervised scenario. That is, given a set of low-quality images , we aim to learn a translator that maps to its corresponding high-quality image and vertebra shape without having paired (i.e., anatomically identical) groundtruth data and available for supervision. To facilitate this unsupervised learning, we also assume the availability of a high-quality image dataset where is a segmentor that maps a high-quality CT image to its corresponding vertebra shape . Note that and are independent, i.e., they are collected from different patients. Next, we introduce how to leverage the knowledge from this independent dataset for the unsupervised learning of from .

Given two independent datasets and , learning the translation between and is essentially an unsupervised cross-modality artifact reduction problem. Similar to the idea by Liao et al. [13], we propose to address this task via artifact disentanglement. As this work also aims to learn vertebra shapes of , we propose a shape-aware artifact disentanglement approach that takes the vertebra shape learning into consideration, and jointly improves artifact reduction and vertebra segmentation.

In artifact disentanglement, we assume that the content (i.e., bones, soft tissues, etc.) and the artifacts (i.e., noises, streaks, etc.) of low-quality images can be encoded separately in the latent space. For high-quality images , there are no artifacts and therefore only the the content will be encoded. This disentanglement allows decodings between the different combinations of the artifact and content components of and , which enables different forms of generations (Section 3.1). The unsupervised learning is achieved by designing losses that encourage the generations enabled by artifact disentanglement (Section 3.2).

More importantly, we fuse the shape representation learning into the artifact disentanglement. On the one hand, we leverage the learned vertebra shapes as attentions to guide the generations for better artifact reduction (Section 3.1). On the other hand, we leverage the generated images to discover the consistency in vertebra shapes and achieve unsupervised vertebra segmentation (Section 3.2).

3.1 Network Architecture

An overview of the network architecture of the proposed approach is shown in Figure 2. Our network takes two unpaired images and as inputs. For , we use a content encoder and an artifact encoder to encode its content and artifact components, respectively. As does not contain artifacts, we only use a content encoder to encode its content. The latent codes are written as:

(1)

With four different combinations of the decodings, our network has four outputs , , , and , where means the output is encoded with the content of and intended to look like a sample from . We use two different decoders and to generate the four outputs. The low-quality image decoder takes a content code and an artifact code as inputs and outputs a low-quality image :

(2)

Note that is encoded and decoded entirely from and thus should look identical to . is encoded with the content of and the artifact of , and thus should match the content of but contain artifacts transferred from .

The high-quality image decoder takes a content code and a vertebra shape attention map as inputs and outputs a high-quality image :

(3)

where the shape attention map is generated by a shape decoder . We use to explicitly fuse the vertebra shape information into the decoding such that generates better the vertebra region which is critical in clinical practice. We will also show later (Section 3.2) that learning can be achieved using the vertebra shapes from . Also, note that is generated from the content of and thus should be identical to . is generated only using the content of and thus anatomically look like but with the artifacts removed and quality improved, which is exactly what we aim to achieve in this work.

Another goal of this work is to segment vertebra shapes from . To this end, we use a low-quality image segmentor to map images from domain to space :

(4)

Likewise, we also use a high-quality image segmentor to map images from domain to space :

(5)

3.2 Network Learning

To learn the proposed network, we design image domain losses and shape domain losses that leverage the adversarial learning as well as various form of consistencies between the inputs and outputs to obviate the need for the groundtruth data of .

Image domain losses. Image domain losses encourage the network to generate the four outputs as intended, i.e., should match the content of and look like a sample from .

Adversarial loss. For the two outputs and , there are no groundtruth data available. Therefore, we apply adversarial learning to encourage that looks like a sample from :

(6)

where is a discriminator that decides if the input is a sample from or a synthetic data.

Artifact consistency loss. The adversarial loss only encourages to look like a sample from but does not encourage their content to match . To impose this content preciseness, we employ a artifact consistency loss [13]:

(7)

The loss ensures that the same artifact is removed from and added to . Meanwhile, minimizing the loss keeps the synthetic image / anatomically close to the original one /.

Image reconstruction loss. The two outputs and are encoded and decoded entirely using the latent codes of and

, respectively, which essentially are autoencodings. Thus, we use L1 loss to regularize the pixel-wise distance between the inputs and the reconstructed images in each image domain,

(8)

L1 loss is known to retain the sharpness compared with L2 loss  [10].

Self-reduction loss. We also apply the cycle consistency mechanism [30, 8, 14] for . That is, the artifacts generated in can also be removable by our model by applying , which recovers the input :

(9)

Shape domain losses. Shape domain losses leverage the vertebra shapes from and discover the shape consistencies for the learning of the two segmentors and .

Shape loss. The segmentor takes an image from and outputs the corresponding vertebra shape prediction. We supervise the segmentation of and with by Dice loss:

(10)

where denotes the pixel index. The Dice loss is first introduced by Millectari et al. [15], which alleviates the imbalance problem between foreground and background pixels and greatly improves the segmentation performance.

Shape loss related to content code. is decoded from content code by . We compute Dice loss between prediction and ground truth :

(11)

Shape reconstruction loss. As the anatomical information is supposed to be retained during reconstruction, the segmentation prediction should be close to , and should be close to . We use L1 loss to minimize the distance,

(12)

Shape translation loss. Similarly, we can apply L1 loss to minimize the distance between segmentation pairs (, ) and (, ) during translation,

(13)

Total Loss. The overall objective function is formulated as the weighed sum of all the above losses,

(14)

where is the weight for the loss of type . In this paper, we empirically set the weights as , .

3.3 Implementation Details

Most of the network components of the proposed shape-aware artifact disentanglement network are developed based on the state-of-the-art works for image-to-image translation. For the encoders , , and , we adopt the structure proposed by Huang et al. [8]. For the low-quality image decoder , we employ a similar structure by Liao et al. [13] to fuse the artifact and content codes. In addition, we use the PatchGAN designed by Isola et al. [10] as our discriminators and . For the segmentors and , we simply use the U-Net architecture [19]

with a depth of five. We use Sigmoid activation for the last convolutional layer to output the probability of a pixel belonging to vertebrae. We refer readers to the supplementary material for more detailed network structures.

To better retain the anatomical structure in the synthetic CT image, we adopt the idea of SPADE [16] and design a shape-aware normalization layer (see Figure 3a). extracts the shape representation from content code .

is then interpolated to match the spatial dimension of the input feature

and used as a soft semantic mask for SPADE. There are three Conv layers and a parameter-free batch-normalization (Norm) layer. The first Conv layer encodes interpolated to a hidden space and then the other two Conv layers learn spatial related parameter and . Meanwhile, is only normalized without scale and shift with Norm. The normalized feature is denormalized with and learned from , resulting in the output . The detailed structure of is shown in Figure 3b. All the Norm layers in residual, upsampling and final blocks of are replaced by the shape-aware normalization layer (i.e., SPADE). Our model benefits from the new structure in two aspects. First, the learned shape representation guides the synthesis, which prevents washing away the anatomical information. Second, the soft mask allow the gradients being back-propagated though disentanglement learning, which encourages the encoding of content code to be more accurate.

Figure 3: Detailed architecture. (a) Shape-aware normalization. (b) Decoder with shape-aware normalization.

4 Experiment

4.1 Dataset

CBCT data. The CBCT data are collected by a Siemens Arcadis Orbic 3D system during spinal intervention. The dataset contains 26 scans. The size of CBCT volumes is 256 256 256. The isotropic voxel size is 0.5 mm. Due to the severe cone-beam geometry distortion at the two ends of the sagital axis, we only keep 196 slices in the middle for each volume. We use 21 volumes for training and 5 volumes for testing, resulting in 4116 slices in the training set and 980 slices in the testing set. To evaluate the segmentation performance, the vertebra masks for the testing set are manually labeled by an expert.

Dice/HD(mm)
M1 0.683/30.54 n.a./n.a. n.a./n.a. n.a./n.a. 0.938/9.74 n.a./n.a. n.a./n.a. n.a./n.a.
M2 0.737/30.57 n.a./n.a. n.a./n.a. n.a./n.a. 0.946/9.51 n.a./n.a. n.a./n.a. n.a./n.a.
M3 0.765/21.20 0.727/32.98 0.680/35.83 0.834/13.18 0.941/8.70 0.942/7.13 0.939/7.06 0.942/7.71
M4 0.815/22.59 0.861/12.04 0.861/12.06 0.818/19.33 0.950/8.37 0.952/7.35 0.955/6.97 0.952/7.22
M5 0.847/16.16 0.879/10.54 0.879/10.56 0.879/10.33 0.945/8.62 0.948/6.85 0.950/6.73 0.946/7.25
Table 1: Quantitatively evaluation segmentation performance for different models.

CT data. The Datasets 13222http://lit.fe.uni-lj.si/xVertSeg/ and 15 [26] of SpineWeb [6] are used as high-quality CT images. We include fifteen CT scans from Dateset 13 and twenty CT scans from Dataset 15. The scans from Datasets 13 and 15 cover the lumbar spine and the entire thoracic/lumber spine, respectively. All the scans have corresponding segmentation masks for the vertebrae. The in-plane resolution is between 0.31 mm and 0.8 mm and the slice thickness is between 0.7 mm to 2 mm. We only include the CT slices with vertebrae in the experiment. To match the resolution and spatial dimension of CBCT image, all the CT slices are resampled to a spacing of 0.5 mm and randomly cropped to the size of 256

256 with vertebrae present in the field of view. We withhold two scans from Dataset 13 and three scans from Dataset 15 for testing, resulting in 7,420 CT images for training and 1,937 for testing. Because there are more CT images than CBCT image, we randomly select 4,116 CT images for each epoch during training. For testing, we fix a randomly selected dataset of 980 CT images.

4.2 Training Details

We implement our model using the PyTorch framework 

[17] and train it for 48 epochs using the Adam optimizer with a learning rate of and a batch size of 1. For data augmentation, the images and segmentation masks are randomly flipped horizontally during training.

4.3 Evaluation Methods

We evaluate the performance of vertebra segmentation using Dice score and Hausdorff Distance (HD) [4, 5, 1]. Dice score [0,1] measures the relative volumetric overlap between segmentations, and a perfect segmentation yields a Dice score of 1. HD is also known as maximum surface distance, which reflects the agreement between segmentation boundaries [22]. We obtain the segmentation mask from shape prediction by applying a threshold of 0.5.

M1 M2 M3 M4 M5
Figure 4: Qualitative comparison for image translation and segmentation of different variants of our model.

4.4 Ablation Study

In this section, we investigate the effectiveness of different modules and objectives of the proposed architecture. Our backbone is ADN [13] using , , , , and learning with , , , . Note that in ADN uses instance normalization. We use the following configurations for this ablation study:

  • M1: ADN with shape decoder and shape loss related to content code .

  • M2: M1 with using shape-aware normalization.

  • M3: M2 with segmentors , and shape loss .

  • M4 : M3 with shape reconstruction loss .

  • M5 (full) : M4 with shape translation loss .

Table 1 summarizes the quantitative evaluation results of vertebra segmentation performance of different models. Figure 4 shows the reconstructions and translations of a sample CBCT image as well as the vertebra segmentation results and ground truth. In , we can see a small metallic object above the vertebra, which is used for spinal surgery guidance. The metallic object introduces strong artifacts, such as bright and dark bands around the metal and streak artifacts nearly everywhere.

M1 can roughly disentangle artifacts and anatomical information. Figure 4 shows most of artifacts are suppressed in the synthetic CT image of M1, but streak artifacts are still visible in the lower part. The anatomical structure is not well retained in as the boundaries of vertebra are smoothed out. M1 learns to extract the shape of vertebra from content code with the supervision of target domain. achieves a Dice of 0.938, but only reaches a Dice of 0.683. The segmentation performance is not satisfying.

The effect of shape-aware normalization. The proposed normalization layer improves the segmentation performance as well as the quality of synthetic image. From M1 to M2, the Dice of is increased by 0.054, and captures more bony structure. The translated image of M2 is sharper and with better contrast compared with M1. With the special structure, is able to be punished in image translation and reconstruction process and the other encoders and decoders receive more guidance.

The effect of shape loss. M3 highly improves the segmentation performances for and with two additional segmentors. Among all the CBCT segmentations, achieves the best performance with a Dice of 0.834 and a HD of 13.14 mm. With the fully supervised , M3 learns to predict accurate vertebra shape from translated CT image. The segmentation performance of is also significantly improved. Compared with M2, its Dice is increased by 0.28 and the HD is decreased by 9.37 mm. But the performance of on or is relative low even when is able to output precise segmentation for with a Dice of 0.939. As shown in Figure 4, there are lots of false positive and false negative errors in and . It is because is confused by the bright bands around the metal, which have the same intensity as bone, moreover, the streak artifacts distort the lower bone structure.

The effect of shape reconstruction and translation constraints. When the shape representation is retained during reconstruction for both domains, M4 achieves better segmentation performances for all the CT and CBCT images but not for , compared with M3. As shown in Figure 4, and are hugely improved with , but the segmentation of becomes too sensitive and false positive predictions show up. After adding , all the segmentations of CBCT images in M5 are further improved. From Figure 4, we can observe the false positive predictions of in M4 are suppressed and becomes more realistic. in M5 yields the best segmentation performance for CBCT images with a Dice of 0.879 and a HD of 10.33 mm. For the synthetic CT image, M4 recovers the right rib bone which is smoothed out by M3. Overall, M5 generates with best image quality among all the models, which is with the best metal artifact reduction performance and bone contrast.

The effect of explicit segmentors. As shown in Figure 2, the segmentor can also be replaced by a combination of and . We train a model with such replacements and the same losses as M5, where is replaced by . The best segmentation performance on CBCT images is with a Dice of 0.850 and a HD of 15.00 mm, which is worse than M5. Large errors in as well as false bone structures in indicate the model can not capture the anatomical priors correctly when the attention map and segmentation are forced to be identical.

Figure 5: Qualitative comparison for segmentation performance.
Dice/HD(mm)
0.952/8.91 0.734/42.17 0.825/20.80
0.923/27.84 0.787/23.17 n.a./n.a.
0.950/6.85 n.a./n.a. 0.879/10.33
Table 2: Quantitative evaluation for segmentation performance of joint learning and domain adaptation.
CBCT CycleGAN [30] DRIT [12] ADN [13] Ours
Figure 6: Qualitative comparison for synthesis quality.

4.5 Evaluation

Since image synthesis and segmentation benefit from each other in joint learning. We evaluate our model from three aspects. First, we prove the joint learning boosts the performance of segmentors for both domains. Second, we use a vertebra segmentor trained only in CT domain to compare the quality of synthetic CT images of our model and the state-of-the-art methods. Third, we compare the MAR performances qualitatively.

Dice/HD(mm) Synthetic CT
CycleGAN [30] 0.727/26.01
DRIT [12] 0.718/20.21
ADN [13] 0.764/21.86
Our Model 0.825/20.80
Table 3: Quantitative evaluation for synthesis quality.

Segmentation. The conventional approach for unsupervised segmentation is through domain adaption or image synthesis. To segmented the translated CBCT image [24], we train another segmentor with the same architecture as based on paired CT images and masks for 50 epochs. Following [11], we fine tune with an adversarial network to learn domain invariant features from multiple layers on CBCT images for another 50 epochs. The domain adapted model is referred as . The results are summarized in Table 2. Applying directly on CBCT images serves as a lower-bound performance (Dice: 0.743) and the predictions are discrete. benefits from image translation when applied on and the performance is acceptable (Dice: 0.825). With domain adaption, achieves a better performance on (Dice: 0.787) but worse performance on (Dice: 0.923). The segmentors with domain adaption and image translation output plausible predictions but the performance is affected by metal artifact and unrelated bones, as shown in Figure 5. The jointly learned greatly improves the segmentation performance on (Dice: 0.879). Moreover, the segmentation performance in source domain can also be boosted as the HD of CT segmentation results is decreased by 2.06 mm.

Modality translation. We use the performance of on synthetic CT images as an anatomy-invariant evaluator to compare the image synthesis performance of our model with the state-of-the-arts methods: CycleGAN [30], DRIT [12], ADN [13]. All the models are trained with our data using their officially released codes. As shown in Table 3, our model achieves the best performance with a much lager Dice compared with all the other methods. From Figure 6, we can observe our model performs significantly better than other methods in image translation. CycleGAN and DRIT tend to output plausible and realistic CT images but are not able to preserve the anatomical information precisely. As shown in the segmentation results in Figure 6, the bony structures are distorted and scratched. With the help of

, ADN can retain most of the anatomical information but not the bone pixels with high intensity. ADN may classify the high intensity bone pixels into metal artifact. With the anatomical knowledge from CT domain, our model learns to output high-quality synthetic CT images while keeping anatomy consistency.

Metal artifact reduction. Here we only present the MAR performance of ADN and our model since the distorted anatomical structure makes CycleGAN and DRIT less valuable. As shown in Figure 7, our model outperforms ADN. The shadow of streak artifacts and bright bands remain in the synthetic CT images of ADN, while our model can suppress all the artifacts and keep the bone edges sharp.

CBCT ADN [13] Ours
Figure 7: Qualitative comparison for MAR performance.

5 Conclusions

We present a unified framework to jointly address three seemingly different tasks: unpaired modality translation, unsupervised segmentation and artifact reduction. In particular, we propose to jointly train encoders and decoders with segmentors and a shape-aware normalization layer to utilize the vertebra-shape knowledge across domains. Extensive experimental results demonstrate that the segmentation accuracy, image synthesis quality and MAR performance of our model are better than those obtained by the state-of-the-art methods and the conventional single-task unsupervised segmentation framework.

References

  • [1] P. F. Christ, F. Ettlinger, F. Grün, M. E. A. Elshaera, J. Lipkova, S. Schlecht, F. Ahmaddy, S. Tatavarty, M. Bickel, P. Bilic, et al. (2017)

    Automatic liver and tumor segmentation of ct and mri volumes using cascaded fully convolutional neural networks

    .
    arXiv preprint arXiv:1702.05970. Cited by: §4.3.
  • [2] Q. Dou, C. Ouyang, C. Chen, H. Chen, and P. Heng (2018) Unsupervised cross-modality domain adaptation of convnets for biomedical image segmentations with adversarial loss. arXiv preprint arXiv:1804.10916. Cited by: §2.3.
  • [3] V. Dumoulin, J. Shlens, and M. Kudlur (2016) A learned representation for artistic style. arXiv preprint arXiv:1610.07629. Cited by: §2.2.
  • [4] E. Gibson, F. Giganti, Y. Hu, E. Bonmati, S. Bandula, K. Gurusamy, B. Davidson, S. P. Pereira, M. J. Clarkson, and D. C. Barratt (2018) Automatic multi-organ segmentation on abdominal ct with dense v-networks. IEEE transactions on medical imaging 37 (8), pp. 1822–1834. Cited by: §4.3.
  • [5] E. Gibson, W. Li, C. Sudre, L. Fidon, D. I. Shakir, G. Wang, Z. Eaton-Rosen, R. Gray, T. Doel, Y. Hu, et al. (2018)

    NiftyNet: a deep-learning platform for medical imaging

    .
    Computer methods and programs in biomedicine 158, pp. 113–122. Cited by: §4.3.
  • [6] B. Glocker, D. Zikic, E. Konukoglu, D. R. Haynor, and A. Criminisi (2013) Vertebrae localization in pathological spine ct via dense classification from sparse annotations. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 262–270. Cited by: §4.1.
  • [7] X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 1501–1510. Cited by: §2.2.
  • [8] X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189. Cited by: §2.1, §3.2, §3.3.
  • [9] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.2.
  • [10] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    .
    In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1125–1134. Cited by: §2.1, §3.2, §3.3.
  • [11] K. Kamnitsas, C. Baumgartner, C. Ledig, V. Newcombe, J. Simpson, A. Kane, D. Menon, A. Nori, A. Criminisi, D. Rueckert, et al. (2017) Unsupervised domain adaptation in brain lesion segmentation with adversarial networks. In International conference on information processing in medical imaging, pp. 597–609. Cited by: §2.3, §4.5.
  • [12] H. Lee, H. Tseng, J. Huang, M. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §2.1, Figure 6, §4.5, Table 3.
  • [13] H. Liao, W. Lin, S. K. Zhou, and J. Luo (2019) Artifact disentanglement network for unsupervised metal artifact reduction. arXiv preprint arXiv:1906.01806. Cited by: §2.1, §3.2, §3.3, §3, Figure 6, Figure 7, §4.4, §4.5, Table 3.
  • [14] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. In Advances in neural information processing systems, pp. 700–708. Cited by: §2.1, §3.2.
  • [15] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §3.2.
  • [16] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2337–2346. Cited by: §2.2, §3.3.
  • [17] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.2.
  • [18] R. Pauwels, H. Stamatakis, H. Bosmans, R. Bogaerts, R. Jacobs, K. Horner, K. Tsiklakis, and S. P. Consortium (2013) Quantification of metal artifacts on cone beam computed tomography images. Clinical oral implants research 24, pp. 94–99. Cited by: §1.
  • [19] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.3.
  • [20] S. Schafer, S. Nithiananthan, D. Mirota, A. Uneri, J. W. Stayman, W. Zbijewski, C. Schmidgunst, G. Kleinszig, A. J. Khanna, and J. Siewerdsen (2011) Mobile c-arm cone-beam ct for guidance of spine surgery: image quality, radiation dose, and integration with interventional guidance. Medical physics 38 (8), pp. 4563–4574. Cited by: §1.
  • [21] J. H. Siewerdsen (2011) Cone-beam ct with a flat-panel detector: from image science to image-guided surgery. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment 648, pp. S241–S250. Cited by: §1.
  • [22] A. A. Taha and A. Hanbury (2015) An efficient algorithm for calculating the exact hausdorff distance. IEEE transactions on pattern analysis and machine intelligence 37 (11), pp. 2153–2163. Cited by: §4.3.
  • [23] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2017) Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6924–6932. Cited by: §2.2.
  • [24] J. Yang, N. C. Dvornek, F. Zhang, J. Chapiro, M. Lin, and J. S. Duncan (2019) Unsupervised domain adaptation via disentangled representations: application to cross-modality liver segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 255–263. Cited by: §2.3, §4.5.
  • [25] Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kalra, Y. Zhang, L. Sun, and G. Wang (2018) Low-dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss. IEEE transactions on medical imaging 37 (6), pp. 1348–1357. Cited by: §1.
  • [26] J. Yao, J. E. Burns, D. Forsberg, A. Seitel, A. Rasoulian, P. Abolmaesumi, K. Hammernik, M. Urschler, B. Ibragimov, R. Korez, et al. (2016) A multi-center milestone study of clinical vertebral ct segmentation. Computerized Medical Imaging and Graphics 49, pp. 16–28. Cited by: §4.1.
  • [27] Y. Zhang and H. Yu (2018) Convolutional neural network based metal artifact reduction in x-ray computed tomography. IEEE Trans. Med. Imaging 37 (6), pp. 1370–1381. Cited by: §1.
  • [28] Y. Zhang, L. Zhang, X. R. Zhu, A. K. Lee, M. Chambers, and L. Dong (2007) Reducing metal artifacts in cone-beam ct images by preprocessing projection data. International Journal of Radiation Oncology* Biology* Physics 67 (3), pp. 924–932. Cited by: §1.
  • [29] Z. Zhang, L. Yang, and Y. Zheng (2018) Translating and segmenting multimodal medical volumes with cycle-and shape-consistency generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9242–9251. Cited by: §2.1.
  • [30] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.1, §3.2, Figure 6, §4.5, Table 3.