More Knowledge is Better: Cross-Modality Volume Completion and 3D+2D Segmentation for Intracardiac Echocardiography Contouring

by   Haofu Liao, et al.
University of Rochester

Using catheter ablation to treat atrial fibrillation increasingly relies on intracardiac echocardiography (ICE) for an anatomical delineation of the left atrium and the pulmonary veins that enter the atrium. However, it is a challenge to build an automatic contouring algorithm because ICE is noisy and provides only a limited 2D view of the 3D anatomy. This work provides the first automatic solution to segment the left atrium and the pulmonary veins from ICE. In this solution, we demonstrate the benefit of building a cross-modality framework that can leverage a database of diagnostic images to supplement the less available interventional images. To this end, we develop a novel deep neural network approach that uses the (i) 3D geometrical information provided by a position sensor embedded in the ICE catheter and the (ii) 3D image appearance information from a set of computed tomography cardiac volumes. We evaluate the proposed approach over 11,000 ICE images collected from 150 clinical patients. Experimental results show that our model is significantly better than a direct 2D image-to-image deep neural network segmentation, especially for less-observed structures.


page 2

page 5

page 6


Multi-Modality Pathology Segmentation Framework: Application to Cardiac Magnetic Resonance Images

Multi-sequence of cardiac magnetic resonance (CMR) images can provide co...

Automatic Myocardial Infarction Evaluation from Delayed-Enhancement Cardiac MRI using Deep Convolutional Networks

In this paper, we propose a new deep learning framework for an automatic...

Cardiac MR Image Segmentation Techniques: an overview

Broadly speaking, the objective in cardiac image segmentation is to deli...

Translating and Segmenting Multimodal Medical Volumes with Cycle- and Shape-Consistency Generative Adversarial Network

Synthesized medical images have several important applications, e.g., as...

MRI Image-to-Image Translation for Cross-Modality Image Registration and Segmentation

We develop a novel cross-modality generation framework that learns to ge...

Unsupervised Multi-modal Style Transfer for Cardiac MR Segmentation

In this work, we present a fully automatic method to segment cardiac str...

Deep Negative Volume Segmentation

Clinical examination of three-dimensional image data of compound anatomi...

1 Introduction

Atrial fibrillation (AF) affects about 2% to 3% of the population in Europe and North America as of 2014 [13]. One of its treatments is to perform catheter ablation to destroy the cardiac tissues causing the abnormal electrical signal. During catheter ablation, intracardiac echocardiography (ICE) is often used to guide the intervention. Compared with other imaging modalities such as transoesophageal echocardiography, ICE provides better patient tolerance, requiring no general anesthesia [2]. The junction (ostia) of the pulmonary veins with the left atrium (LA) (see Fig. 1 (a)) is usually where the catheter ablation is performed [8]. However, due to the limitations of 2D ICE, these 3D anatomical structures may only be partially viewed. This can introduce difficulties for the electrophysiologists as well as for automated analysis algorithms. Fortunately, modern ICE devices are equipped with an embedded position sensor that measures the 3D location of the ICE transducer. The spatial geometry information associated with the ICE image is key to this study.

Existing approaches to 2D echocardiogram segmentation only focus on single cardiac chamber such as left ventricle (LV) [5, 9, 11] or LA [1]. They are designed to distinguish between the blood tissues and the endocardial structures which is relatively easy due to the significant difference in appearance. When it comes to multiple cardiac components (chambers and their surrounding structures), where the boundaries cannot be clearly recognized, these methods may fail. To the best of our knowledge, this paper is the first to handle the multi-component echocardiogram segmentation from 2D ICE images.

Figure 1: (a) Graphical illustration of LA and its surrounding structures: blue-LA, green-left atrial appendage (LAA), red-left inferior pulmonary vein (LIPV), purple-left superior pulmonary vein (LSPV), white-right inferior pulmonary vein (RIPV), yellow-right superior pulmonary vein (RSPV). (b) 3D sparse ICE volume generation using the location information associated with each ICE image.

Recently, deep convolutional neural networks (CNNs) have achieved unprecedented success in medical image analysis, including segmentation

[12]. However, our baseline method of training a CNN to directly generate segmentation masks from 2D ICE images does not demonstrate satisfactory performance, especially for the less-observed pulmonary veins . Such a baseline solely relies on the brute force of big data to cover all possible variations, which is difficult to achieve. To go beyond brute force, we further integrate knowledge to boost contouring performance. Such knowledge stems from two sources: (i) 3D geometry information provided by a position sensor embedded inside an ICE catheter, and (ii) 3D image appearance information exemplified by cross-modality computed tomography (CT) volumes that contain the same anatomical structures.

2 Method

The proposed method consists of three parts. Using the 3D geometry knowledge, we first form a 3D sparse volume based on the 2D ICE images. Then, to tap into the 3D image appearance knowledge, we design a multi-task 3D network with an adversarial formulation. The network performs cross-modality volume completion and sparse volume segmentation simultaneously for collaborative structural understanding and consistency. Finally, taking as inputs both the original 2D ICE image and the 2D mask projected from the generated 3D mask, we design a network to refine the 2D segmentation results.

We form a 3D sparse ICE volume from a set of 2D ICE images with each including part of the heart in its field of view and its 3D position from a magnetic localization system. As shown in Fig. 1, we use the location information to map all ICE images (left) to 3D space (middle), thus forming a sparse ICE volume (right). The generated sparse ICE volume keeps the spatial relationships among individual ICE views. A segmentation method based on the sparse volume can take this advantage for better anatomical understanding and consistency.

2.1 3D Sparse Volume Segmentation and Completion

The architecture of the proposed 3D segmentation and completion network (3D-SCNet) is illustrated in Fig. 2(a). The network consists of a generator and two discriminators and . Taking the sparse ICE volume as input, performs 3D segmentation and completion simultaneously, and outputs a segmentation map as well as a dense volume . During training, the ground truth of is a CT volume instead of a dense ICE volume as we lack the training data of the latter. The ICE images and the CT volumes are from completely different patients. This inherently creates a challenging cross-modality volume completion problem with unpaired data. We target this problem through adversarial learning and mesh pairing (See Sec. 3). The two discriminators judge the realness of the outputs from the generator. When trained adversarially together with a generator, they make sure the generator’s outputs are more perceptually realistic. Following conditional GAN [4], we also allow the discriminators to take as the input to further improve adversarial training.

(a) 3D-SCNet
(b) 2D-RefineNet
Figure 2: The network architectures of the proposed method.

Adversarial loss The segmentation task and completion task are trained jointly in a multi-task learning (MTL) fashion [3]. The adversarial loss for a task can be written as


where denote the data distributions. For a real data , i.e., the ground truth segmentation map or CT volume, is trained to predict a “real” label. For the generated data , learns to give a “fake” label. On the other hand, the generator is trained to deceive by making as “real” as possible.

Reconstruction loss Adversarial loss alone, however, does not give a strong structural regularization to the training [6]. Hence, we use reconstruction loss to measure the pixel-level error between the generator outputs and the ground truths. For the segmentation task, we first convert the score map to a multi-channel map with each channel denoting the binary segmentation map of a target anatomy and then apply an L2 loss between and . For the completion task, the L1 loss between and is measured. We use L1 loss instead of L2 loss for this task due to the observation that outputs from L2 losses are usually overly smoothed. The total loss of the sparse volume segmentation and completion network is given by


where and balance the importance of the reconstruction loss and reconstruction loss, respectively.

Architecture details We use a 3D UNet-like network [7]

as the generator. There are 8 consecutive downsampling blocks followed by 8 consecutive upsampling blocks in the network. We use skip connections to shuttle feature maps between two symmetric blocks. Each downsampling block contains a 3D convolutional layer, a batch normalization layer and a leaky ReLU layer. Similarly, each upsampling layer contains a 3D deconvolutional layer, a batch normalization layer and a ReLU layer. The convolutional and deconvolutional layers have the same parameter settings:

kernel size, stride size and padding size. Finally, a function is attached at the end of the generator to bound the network outputs. The two discriminators and have identical network architecture with each of them having 3 downsampling blocks followed by a 3D convolutional layer and a sigmoid layer. The downsampling blocks for the discriminators are the same as the ones used in the generator. The final 3D convolutional layer ( kernel size, stride size and padding size) and sigmoid layer are used for realness classification.

2.2 2D Contour Refinement

As shown in Fig. 2(b), the 2D refinement network (2D-RefineNet) has a similar structure to the 3D-SCNet. Actually, and have almost the same structure as their 3D counterparts except that the convolutional and deconvolutional layers are now in 2D. The inputs to the 2D-RefineNet is a 2D ICE image together with its corresponding 2D segmentation map , where is obtained by projecting onto . The training of the 2D-RefineNet is also performed in an adversarial fashion and conditional GAN is used to allow observing the generator inputs. We compute the adversarial loss the same way as Eq. (1) and use the L2 distance between the refinement network output and the ground truth 2D segmentation map as the reconstruction loss . The total loss is


where and are the corresponding balancing coefficients.

3 Experiments

Figure 3: Sparse volume segmentation and completion results for 2 cases. (a) Sparse ICE volume; (b) Completed CT volume; (c) the paired “ground truth” CT volume; (d) Predicted and (e) Ground truth 3D segmentation map.

Dataset and preprocessing The left atrial ICE images used in this study are collected using a clinical system with each image associated with a homogeneous matrix that projects the ICE image to a common coordinate system. We perform both 2D and 3D annotations on the ICE images for the cardiac components of interest, i.e., LA, LAA, LIPV, LSPV, RIPV and RSPV. For the 2D annotations, contours of all the plausible components in the current view are annotated. For the 3D annotations, ICE images, from the same patient and at the same cardiac phase 222While in clinical practice multiple 2D ICE clips are acquired to dynamically image a patient’s LA anatomy, here we focus on a stack of 2D ICE images, with often one gated frame per clip, and leave dynamic modeling for future study., are first projected to 3D, and 3D mesh models of the target components are then manually annotated. 3D segmentation masks are generated using these mesh models. In total, the whole database has 150 patients. For each patient, there are 20-80 gated frames for use. We have 3D annotations for all 150 patients. For 2D annotations, we annotated 100 patients, resulting in a total of 11,782 annotated ICE images. By anatomical components, we have in 2D 4669 LA, 1104 LAA, 1799 LIPV, 1603 LSPV, 1309 RIPV, and 1298 RSPV annotations. So, the LA is mostly observed and the LAA and PVs are less observed. For a subset of 1568 2D ICE images, we have 2-3 expert annotations per image to compute the inter-rater reliability (IRR).

As we do not have dense ICE volumes available for training, we use CT volumes instead as the ground truth for the completion task. Each CT volume has an annotated LA mesh model. To pair with a sparse ICE volume, we pick the CT volume whose LA mesh model is closest to that of the targeting sparse ICE volume using Procrustes analysis [10]. In total, 414 CT volumes are available, which gives enough anatomical variability for the mesh pairing. All the data used for 3D training are augmented with random perturbations in scale, rotation and translation to increase the generalizability of the model.

Training and evaluation We train the 3D-SCNet and 2D-RefineNet using Adam optimization with , , . The 3D-SCNet is trained for about epochs with , , , . The 2D-RefineNet is also trained for about epochs with , . All s are chosen empirically and we train the models using 5-fold cross-validation. The segmentation results are evaluated using the Dice metric and average symmetric surface distance (ASSD).

(a) Ground truth
(b) 2D only
(c) 3D only
(d) 2D + 3D
Figure 4: Samples of 2D ICE contouring results from different models.

Results The outputs from the 3D network model are shown in Fig. 3. We can observe that the model not only gives satisfying segmentation outputs, Fig. 3

(d), but also gives a good estimation about the CT volume, Fig.

3(b). Especially, we note that the estimated completion outputs do not give structurally exact results as the “ground truth” but instead try to match the content from the sparse volume. Since the “ground truth” CT volume is paired based on mesh models, this difference is expected. It demonstrates that the completion outputs are based on the sparse volume and the system only tries to complete the missing region such that it looks like a “real” CT volume. We also quantitatively evaluate the performance of the 3D sparse volume segmentation (before projecting on 2D) and obtain the following Dice scores: LA (89.5%), LAA (50.0%), LIPV (52.9%), LSPV (43.4%), RIPV (62.43%), RSPV (57.6%) and overall (86.1%). This shows that using the limited information from sparse volumes our model still can achieve a satisfactory 3D segmentation performance. As we will show in later experiments, the segmentation accuracy, actually, is even higher in the region where 2D ICE images are presented. We also notice that it is vital to use the 3D appearance information – the training fails to converge in our experiment of learning the 3D network without using the 3D appearance information from CT.

Fig. 4 shows the 2D ICE contouring results using different models: the “2D only” model that is trained directly with the 2D ICE images, the “3D only” model by projecting the predicted 3D segmentation results onto the corresponding ICE image, and the “2D + 3D” model by refining the outputs from 3D-SCNet using 2D-RefineNet. We observe from the first row that the “3D only” outputs give better estimation about the PVs (red and orange) than the “2D only” outputs. This is because the PVs in the current 2D ICE view are not clearly visible which is challenging for the “2D only” model. While for the “3D only” model, it makes use of the information from other views and hence predicts better the PV locations. Finally, we see that the outputs from the “2D + 3D” model combines the knowledge from both the 2D and 3D models and generally gives superior outputs than these two models. Similar results can also be found in the second row where we see the “2D + 3D” model not only predicts the location of the PVs (purple and brown) better by making use of the 3D information but also refines the output according to the 2D view.

2D only 3D only 2D+3D IRR
LA 94.3 0.623 93.5 0.693 95.4 0.537 89.6 1.340
LAA 68.2 1.172 66.5 1.206 71.2 1.106 68.8 1.786
LIPV 70.1 0.918 71.7 0.904 72.4 0.856 69.9 1.459
LSPV 65.9 1.275 67.8 0.916 71.1 1.197 62.9 1.582
RIPV 69.6 0.927 71.7 0.889 73.8 0.786 71.4 1.378
RSPV 63.3 0.872 70.4 0.824 70.5 0.862 57.8 1.633
Total 91.0 0.839 89.8 0.834 92.1 0.791 88.6 1.432
Table 1: 2D segmentation accuracy of different models. The results are evaluated in terms of Dice metric (%) and ASSD (mm).

The quantitative results of these models are given in Table 1. The “3D only” model in general has better performance in PVs and worse performance in LA and LAA than the “2D only” model. This is because LA and LAA usually have a clear view in 2D ICE images, unlike the PVs. The “2D + 3D” model combines the advantages of the “2D only” and “3D only” model and in general yields the best performance. The IRR scores from human experts are relatively lower, especially for the LSPV and RSPV. This is expected as these two structures are difficult to view with ICE and there is variability in how far distally experts do their annotation. The IRR scores are generally lower than those from our models, which demonstrates the benefit of using an automatic segmentation model – better consistency.

4 Conclusions and Future Work

We present a knowledge fusion + deep learning approach to ICE contouring of multiple LA components. It uses 3D geometry and cross-modality appearance knowledge for better anatomical understanding and structural consistency. Then, it refines the contours in 2D by exploiting the detailed 2D appearance information. We show that the proposed model indeed benefits from the integrated knowledge and gives superior performance to the models trained individually. In the future, we will investigate the use of temporal information for better modeling and the clinical utility of the generated dense 3D cross-modality views.


  • [1] Allan, G., Nouranian, S., Tsang, T., Seitel, A., Mirian, M., Jue, J., Hawley, D., Fleming, S., Gin, K., Swift, J., et al.: Simultaneous analysis of 2D echo views for left atrial segmentation and disease detection. IEEE TMI 36(1), 40–50 (2017)
  • [2] Bartel, T., Müller, S., Biviano, A., Hahn, R.T.: Why is intracardiac echocardiography helpful? Benefits, costs, and how to learn. European Heart Journal 35(2), 69–76 (2013)
  • [3] Caruana, R.: Multitask learning. In: Learning to learn, pp. 95–133. Springer (1998)
  • [4]

    Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proc. CVPR. pp. 1125–1134 (2017)

  • [5] Lin, N., Yu, W., Duncan, J.S.: Combinative multi-scale level set framework for echo image segmentation. Medical Image Analysis 7(4), 529–537 (2003)
  • [6] Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proc. CVPR. pp. 2536–2544 (2016)
  • [7] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Proc. MICCAI. pp. 234–241. Springer (2015)
  • [8] Sánchez-Quintana, D., López-Mínguez, J.R., Macías, Y., Cabrera, J.A., Saremi, F.: Left atrial anatomy relevant to catheter ablation. Cardiology Research and Practice (2014)
  • [9] Sarti, A., Corsi, C., Mazzini, E., Lamberti, C.: Maximum likelihood segmentation of ultrasound images with rayleigh distribution. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 52(6), 947–960 (2005)
  • [10] Schönemann, P.H.: A generalized solution of the orthogonal procrustes problem. Psychometrika 31(1), 1–10 (1966)
  • [11] Zhou, S.K.: Shape regression machine and efficient segmentation of left ventricle endocardium from 2D B-mode echocardiogram. Medical Image Analysis 14(4), 563–581 (2010)
  • [12] Zhou, S., Shen, D., Greenspan, H. (eds.): Deep learning for medical image analysis. Academic Press (2017)
  • [13] Zoni-Berisso, M., Lercari, F., Carazza, T., Domenicucci, S.: Epidemiology of atrial fibrillation: European perspective. Clinical Epidemiology 6, 213–20 (2014)