1 Introduction
In medical imaging, observing realistic organ shape is a critical step in enabling health care professionals gain a better insight into patients’ body. Accurate depiction of internal organs, such as liver, often allows for more accurate health screening and early diagnosis, as well as planning of procedures such as radiation therapy to target specific locations in the human body. Delineating 3D organ shape from 2D Xray images is an extremely difficult and unsolved problem in biomedical engineering today due to visual ambiguities and information loss as a result of projection. The goal of this problem is to accurately predict the shape of the observed 3D organ given a single image.
Existing liver delineation techniques typically produce organ shape from computed tomography (CT) scans. The procedures to obtain these scans involve long patientdoctor interaction time, costly machinery, and exposure to a high dose of radiation. The practical challenges in obtaining these scans may preclude obtaining accurate organ depictions. In addition, existing delineation tools [30] would delineate (either automatically or semiautomatically) the twodimensional shape in each slice of the threedimensional CT volume and combine the set of predictions into a threedimensional shape. The intermediate processing may introduce an additional source of error to the overall shape prediction quality due to the lack of spatial context.
The key idea of this paper is to reconstruct 3D organ shape from topograms, which are projected 2D images from tomographic devices, such as Xray [26]. These types of images can be much more easily obtained and are often used by medical professionals for planning purposes [18, 24]. Motivated by the significant advances in deep learning techniques for organ segmentation [31] and representation learning on 3D data [8, 25, 21], we pose the problem of organ reconstruction as the task of predicting 3D shape from a single image. Further, we describe an automatic delineation procedure that outputs the shape from the topogram image only, as well as a semiautomatic extension, where we allow the user to outline the approximate twodimensional mask and use it (in conjunction with the topogram) to obtain a more accurate 3D shape prediction.
Our system has two components: a generative shape model, composed of a shape encoder and decoder, and an encoder from 2D observations (topogram only or topogram and mask). The shape encoder and decoder form a variational autoencoder (VAE) [14] generative model in order to represent each shape observation using a compact lowdimensional representation. The topogram and optional mask encoders (whose architectures are similar to [29]) map the partial observations from images (and masks when provided) to the coordinates of the corresponding shape observations. The entire system is optimized endtoend in order to simultaneously infer shapes from topogram image observations and to learn the underlying shape space. This allows us to simultaneously learn a generative shape space covering complex shape variations from the 3D supervisions and infer the shapes from input 2D observations. To validate our approach, we collected a new medical dataset of abdominal CT scans and topogram images, and evaluated the proposed approach on the challenging tasks of 3D liver reconstruction and volume prediction. The contributions of our work are:

An automatic and a semiautomatic approach to perform 3D organ reconstruction from 2D topograms, allowing automatic 3D shape prediction from the topogram only and a more refined prediction where 2D mask annotation is available.

An evaluation of our method on accurate 3D organ volume estimation and reconstruction applications.
2 Related Work
In the medical imaging domain, extraction and visualization of 3D organs is a key step in clinical applications such as surgical planning and postsurgical assessment, as well as pathology detection and disease diagnosis. Of particular interest is the liver, which can exhibit highly heterogeneous shape variation that makes it even more difficult to segment. Previously, liver volume was segmented semiautomatically [9] or automatically using statistical shape models [10], sigmoidedge modelling [6], graphcut [15] and others (see [19] for an overview). Recently, automatic deep learning based methods [4, 5, 17] have been shown to provide impressive results on this task. However, these methods need a CT scan procedure, which is costly and requires a high radiation exposure. On the other hand, Xray and topogram images are easier to obtain, require less radiation, and are often used by medical professionals for planning purposes [18, 24].
Shape extraction from Xray is particularly complex as its projective nature can contain complex or fuzzy textures, boundaries and anatomical part overlap [31]. To mitigate these challenges, traditional methods use prior knowledge, such as motion patterns [32] or intensity and background analysis [22], in order to perform Xray segmentation. More recent methods [23]
focus on learning to segment using deep neural networks. For example,
[1] decomposes Xray into nonoverlapping components, [30] uses a generative adversarial network (GAN) [29] to improve segmentation quality, and [31] applies unpaired imageimage translation techniques to learn to segment Xray by observing CT scan segmentation. These methods achieve remarkable results on 2D shape delineation and segmentation tasks.In parallel, in the computer vision domain, deep generative 3D shape models based on variational autoencoder networks (VAE)
[8, 25] and generative adversarial networks (GAN) [29] have shown superior performance in learning to generate complex topologies of shapes. Combined with a mapping from image space, these methods are able to infer 3D shape predictions from 2D observations. To obtain more detailed and accurate predictions, input annotations, such as landmarks or masks, are often used to guide the synthesis process. [3] incorporates 2D landmarks for alignment optimization of a skinned vertexbased human shape model to image observations. [12] and [27] applies landmark annotations to guide synthesis of observed 3D shape in input images. [2] uses landmarks and [7] incorporates silhouettes to formulate additional objective terms to improve performance in 3D shape reconstruction and synthesis problems.To the best of our knowledge, we are the first to propose both automatic and semiautomatic approaches to 3D organ shape reconstruction from topograms.
3 Overview
An overview of our training pipeline can be seen in Figure 1. Our system consists of several key components: a generative shape model and a set of encoders from 2D observations. The generative model is composed of an encoder and a decoder, where the encoder maps the 3D shapes of organs to their coordinates in the latent space and the decoder reconstructs the shapes back from their coordinates. The first observation encoder is the topogram encoder that maps twodimensional observations to the coordinates of the corresponding shapes. The second observation encoder is the joint topogram and mask encoder that predicts the latent coordinate of the organ shape given the 2D mask and topogram. The mask information, when provided, helps generate a more accurate prediction.
The organ shape prediction approach is very general, and can be used for organs other than human liver. The technique requires access to a database of shape and Xray (twodimensional observation) pairs. We demonstrate an accuracy improvement using user input in the form of 2D masks. Other types of input that can be encoded using a neural network can also be applied in place of masks to improve prediction accuracy.
3.0.1 Generative Model
As input, our system receives a set of examples where is the example shape and is the corresponding topogram image observation. The generative model consists of an encoding component and a decoding component . Here maps shape to its latent coordinate in the stochastic low dimensional space distributed according to prior distribution and maps the latent coordinate back to the shape space
. The loss function of the generative model is composed of a reconstruction loss
and a distribution loss , as is typical for variational autoencoder training. is the binary cross entropy (BCE) error that measures the difference between the ground truth shape and the predicted shape :(1) 
where . is the distribution loss that enforces the latent distribution of to match its prior distribution , where and are the weights applied to each type of loss.
The 3D shape encoder maps an observation, represented with a by by voxel grid, to its
dimensional latent vector
. The normal distribution parameters are defined
and , as is customary for variational autoencoder models. The architecture of the encoder consists of five convolutional layers with output sizes , kernel sizefor each layer, and padding sizes
,,, and. The convolutional layers are separated by batchnormalization
[11]and ReLU layers
[20]. The 3D shape decoder takes as input a single dimensional latent vector , and predicts a by by voxelized representation of shape. The decoder architecture mirrors that of the encoder.3.0.2 Topogram Encoder
Given a generative model , we can learn a topogram image encoder , so that for each observation , the image is mapped to the coordinate location such that the reconstructed shape and the ground truth shape are as close as possible. The image encoder loss is the binary cross entropy (BCE) loss as defined in Equation 1.
The topogram encoder takes a by by topogram image, and outputs a dimensional latent shape vector . It consists of five convolutional layers with the number of outputs , kernel sizes
and strides
, separated by batchnormalization [11]and rectified linear units (ReLU)
[20].3.0.3 Topogram and Mask Encoder
For each observation , given a topogram and a mask , where is defined to be an orthographic projection operator, we train the joint topogram and mask encoder that outputs so that and are as close as possible. The loss of is defined to be the binary cross entropy (BCE) error , as defined Equation 1. We also enforce an additional mask loss:
that ensures that the input mask and the projected mask of the predicted shape (i.e. ) match.
The topogram and mask encoder consists of a topogram encoder branch, a mask encoder branch, and a common combiner network (see Figure 1), so that the observations are mapped to a common latent coordinate . The topogram encoder branch has the same architecture as the topogram encoder in Section 3.0.2 and maps the topogram to an intermediate dimensional feature . The mask encoder branch receives a by by binary mask image which it maps to a dimensional vector using five convolutional layers with kernel sizes of and strides , separated by batchnormalizations [11] and rectified linear units (ReLU) [20]. and are then concatenated and run through the the combiner network consisting of a single fully connected layer to predict a joint dimensional latent coordinate .
3.0.4 Combined Training
To train the models, we optimize the all the components of the system together in an endtoend training process using the combined objective:
where if training the topogrammask encoder, and when training the topogramonly encoder. Note that is the reconstruction loss of the VAE and is the 2D3D reconstruction loss. It is also possible to train the above model without the shape encoder, i.e. and .
4 Experimental Results and Discussion
We perform extensive quantitative and qualitative experiments of our method on the difficult tasks of estimating 3D shape of the human liver and predicting its volume. Due to their heterogeneous and diffusive shape, automatic liver segmentation is a very complex problem. Using our method we can accurately estimate the 3D shape of the liver from a 2D topogram image and optionally a 2D mask. We use voxel grids as our base representation, and visualize results using 2D projections or 3D meshes obtained using marching cubes [16].
We investigate the effect of shape context provided by the mask observations by evaluating a baseline where 3D shape is predicted directly from the mask. We also quantitatively compare our method to an adversarial baseline [28] approach.
4.1 Dataset
To conduct an experimental evaluation, we collected abdominal CT scans (3D volumetric images of the abdomen covering the liver organ) from several different hospital sites. The liver shapes were segmented using volumetric segmentation approach [30] and topograms and masks are extracted via 2D projection. Examples from the dataset as well as the provided annotations are shown in Figure 2. We use scans for training, and for testing.
We demonstrate several direct applications of our method: threedimensional shape reconstruction with corresponding twodimensional liver delineation through projection and organ volume prediction.
4.2 Organ Shape Reconstruction from Topograms
Given a learned generative model of liver shapes and an image encoder which estimates a latent space vector given a topogram image (and mask, if given), we predict the 3D liver shape, and project it back onto the topogram image plane to perform twodimensional delineation. Visually delineating accurate shape from topograms is particularly difficult due to visual ambiguities, such as color contrast and fuzzy boundaries. Our method can predict the threedimensional shapes automatically from the topogram, and refine the prediction, given a twodimensional mask annotation.
4.2.1 Qualitative Evaluation
In Figure 3, we visualize the 3D reconstruction results. The first column is a visualization of the input topogram, the second column is the visualization of the ground truth 3D shape, the third column is the visualization of the result of the topogramonly approach, the fourth column is the visualization of the result of the topogram+mask approach, and the fifth and sixth columns are visualizations of projected masks of the corresponding two approaches, overlaid with the ground truth masks. Each row corresponds to a different example.
Both proposed methods are able to capture significant variation in the observed shapes, such as a prominent dome on the right lobe in Example and shape of the left lobe in Example . The topogram+mask method is able to convey more topological details compared to the topogramonly method: an elongated interior tip in Examples and , protrusion off left lobe in Examples and , and overall topology in Example , where the maskbased method corrects the hole artifact introduced by the topogramonly method. Overall, the surfaces in predictions from the maskbased method are visually closer to the ground truth.
We also project the 3D predictions directly on the input topograms (see Figure 4). This allows us to visualize the corresponding inferred 2D segmentation. The shape reconstruction network (in both topogram only and topogram+mask methods) learns to emphasize on characteristic parts of the organ shape, such as the curves in the right lobe and interior tip.
4.2.2 Quantitative Evaluation
Several metrics can be used to quantitatively compare 3D shape reconstructions (see [4] for details). We provide a quantitative evaluation using two popular volumebased metrics (Intersection over Union (IoU) and Dice coefficients) and a surfacebased metric (Hausdorff distance) in Table 1. The topogram+mask approach outperforms the topogram only approach according to all of the metrics, but especially according to Hausdorff distance, which is very sensitive to shape variations such as critical cases of incorrect tip or bulge presence prediction.
Metric (Mean)  Mask Only  Topogram Only  Topogram + Mask 

IOU  0.58  0.78  0.82 
Dice  0.73  0.87  0.90 
Hausdorff  28.28  7.10  5.00 
4.2.3 Shape Context
It is important to investigate whether the provided mask provides too much context, rendering the problem of 3D shape prediction a much easier task. We thus train a maskonly baseline that learns to reconstruct 3D shape directly from mask (no topogram image provided). In Table 1, we compare the performance of this baseline and the two methods that receive the topogram as input. The mask only method is unable to achieve the same quality of results as the topogrambased methods, generating significantly lower mean IoU and Dice errors, and a much larger Hausdorff error. The topogram images contain important information, such as shape layout, that is complementary to the context extracted from masks, and thus both inputs are needed for high quality reconstruction.
4.3 Volume Calculation
Of particular interest in the medical community is the automatic volume measurement of main organs. Our method predicts the 3D shape, which we can directly use to measure organ volume. In Table 2, we evaluate our proposed approaches on the task of volume prediction. We use the volume of the voxelized 3D segmentation of the liver, obtained from segmentation of the 3D CT, as the ground truth. Given the 3D shape prediction, we measure the predicted volume as the number of voxels in the generated shape (which can be converted to milliliters (mL) using scanning configuration parameters). We report the volume error prediction where and are the volumes of the predicted and ground truth organs, respectively.
On average, we are able to predict liver volume to error with the topogram+mask method and to error with the topogram only method. The maskonly based method is unable to predict volume accurately, since it cannot predict the correct 3D topology (see Section 4.2.3).
Metric  Mask Only  Topogram Only  Topogram + Mask 
Volume Error ()  0.34  0.10  0.06 
4.4 Comparison to Adversarial Approaches
We also compare our method to an adversarial baseline (3D VAEGAN [29]
) which is another commonly used generative modelling approach. We train this baseline with the same architecture and hyperparameters described in
[29]. We observe that the discriminator in this baseline would typically encourage more uniform predictions compared to our VAEbased method, thus discouraging generation of more diverse shape topologies. Quantitatively, this method achieves lower quality results than the both VAEbased methods (see Table 3), especially in surfacebased error and volume error due to its tendency to predict an average shape irrespective of the input.Volume Prediction  Shape Reconstruction  

Volume Error ()  IoU  Dice  Hausdorff  

0.10/0.06  0.78/0.82  0.87/0.90  7.10/5.00  
Adversarial (3DGAN) [29]  0.21  0.61  0.75  10.50  
Performance Difference  109% / 250%  22% / 26%  14% / 17%  48% / 110% 
4.5 Conclusion and Future Work
3D organ shape reconstruction from topograms is an extremely challenging problem in medical imaging. Among other challenges, it is a difficult problem because the input Xray images can contain projection artifacts that reconstruction methods need to handle, in addition to predicting the topology of occluded and unseen parts the threedimensional organ. The core insight of this work is that, despite the visual ambiguities present in this type of imagery, it is possible to predict 3D organ shape directly from topograms. It is also possible to improve the quality of the prediction by providing supplementary twodimensional shape information in the form of masks.
This work is only a first step towards performing more accurate and reliable 3D organ shape reconstruction. It would be interesting to investigate the performance of our approach on organs other than liver, such as lung or heart, and explore other types of user inputs and annotations that can improve the reconstruction quality. Also, it would be critical to study why 2D to 3D mapping is possible, and what types of neural networks (in this work we focused on the VAE) are best suited for modelling the shape space and achieving high reconstruction accuracy. Further, categorizing the dataset according to data perturbations, such as fatty liver, tumors, liver disease, age or gender, one should study how these factors affect the performance accuracy. Finally, it would be important to analyze how Xray can help improve reconstruction accuracy when 3D scans are available and extracting liver shape can be posed as a 3D shape segmentation problem. We hope this work will inspire other approaches that apply generative 3D modelling techniques to reconstructing and predicting organ shapes.
4.6 Acknowledgements
We thank Daguang Xu for help with anatomical part labelling and discussions; Thomas Funkhouser, Terrence Chen, Kai Ma, and members of the Princeton Graphics and Vision Group for helpful suggestions; Sungheon Gene Kim, Linda Moy, Krzysztof Geras, and Kyunghyun Cho for discussions on medical applications of the proposed method. This work was supported by Siemens Healthcare and NSFGRFP.
References
 [1] Albarqouni, S., Fotouhi, J., Navab, N.: Xray indepth decomposition: Revealing the latent structures. In: MICCAI. pp. 444–452. Springer (2017)
 [2] Balashova, E., Singh, V., Wang, J., Teixeira, B., Chen, T., Funkhouser, T.: Structureaware shape synthesis. In: 3DV. pp. 140–149. IEEE (2018)
 [3] Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In: ECCV. pp. 561–578. Springer (2016)
 [4] Christ, P.F., Ettlinger, F., Grün, F., Elshaera, M.E.A., Lipkova, J., Schlecht, S., Ahmaddy, F., Tatavarty, S., Bickel, M., Bilic, P., et al.: Automatic liver and tumor segmentation of CT and MRI volumes using cascaded fully convolutional neural networks. arXiv preprint arXiv:1702.05970 (2017)
 [5] Dou, Q., Chen, H., Jin, Y., Yu, L., Qin, J., Heng, P.A.: 3D deeply supervised network for automatic liver segmentation from CT volumes. In: MICCAI. pp. 149–157. Springer (2016)
 [6] Foruzan, A.H., Chen, Y.W.: Improved segmentation of lowcontrast lesions using sigmoid edge model. International Journal of Computer Assisted Radiology and Surgery 11(7), 1267–1283 (2016)
 [7] Gadelha, M., Maji, S., Wang, R.: 3D shape induction from 2D views of multiple objects. In: 3DV. pp. 402–411. IEEE (2017)
 [8] Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: ECCV. pp. 484–499. Springer (2016)
 [9] Häme, Y., Pollari, M.: Semiautomatic liver tumor segmentation with hidden markov measure field model and nonparametric distribution estimation. Medical image analysis 16(1), 140–149 (2012)
 [10] Heimann, T., Van Ginneken, B., Styner, M.A., Arzhaeva, Y., Aurich, V., Bauer, C., Beck, A., Becker, C., Beichel, R., Bekes, G., et al.: Comparison and evaluation of methods for liver segmentation from CT datasets. IEEE transactions on medical imaging 28(8), 1251–1265 (2009)
 [11] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
 [12] Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Categoryspecific object reconstruction from a single image. In: CVPR. pp. 1966–1974 (2015)
 [13] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [14] Kingma, D.P., Welling, M.: Autoencoding variational bayes (2014)
 [15] Li, G., Chen, X., Shi, F., Zhu, W., Tian, J., Xiang, D.: Automatic liver segmentation based on shape constraints and deformable graph cut in CT images. IEEE Transactions on Image Processing 24(12), 5315–5329 (2015)
 [16] Lorensen, W.E., Cline, H.E.: Marching cubes: A high resolution 3D surface construction algorithm. In: ACM siggraph computer graphics. vol. 21, pp. 163–169. ACM (1987)

[17]
Lu, F., Wu, F., Hu, P., Peng, Z., Kong, D.: Automatic 3D liver location and segmentation via convolutional neural network and graph cut. International journal of computer assisted radiology and surgery
12(2), 171–182 (2017)  [18] MayoSmith, W.W., Hara, A.K., Mahesh, M., Sahani, D.V., Pavlicek, W.: How I do it: managing radiation dose in CT. Radiology 273(3), 657–672 (2014)

[19]
Mharib, A.M., Ramli, A.R., Mashohor, S., Mahmood, R.B.: Survey on liver CT image segmentation methods. The Artificial Intelligence Review
37(2), 83 (2012) 
[20]
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: ICML. pp. 807–814 (2010)
 [21] Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multiview CNNs for object classification on 3D data. In: CVPR. pp. 5648–5656 (2016)

[22]
Qin, B., Jin, M., Hao, D., Lv, Y., Liu, Q., Zhu, Y., Ding, S., Zhao, J., Fei, B.: Accurate vessel extraction via tensor completion of background layer in Xray coronary angiograms. Pattern Recognition
87, 38–54 (2019)  [23] Ronneberger, O., Fischer, P., Brox, T.: Unet: Convolutional networks for biomedical image segmentation. In: MICCAI. pp. 234–241. Springer (2015)
 [24] Schertler, T., Scheffel, H., Frauenfelder, T., Desbiolles, L., Leschka, S., Stolzmann, P., Seifert, B., Flohr, T.G., Marincek, B., Alkadhi, H.: Dualsource computed tomography in patients with acute chest pain: feasibility and image quality. European radiology 17(12), 3179–3188 (2007)
 [25] Sharma, A., Grau, O., Fritz, M.: Vconvdae: Deep volumetric shape learning without object labels. In: Computer Vision–ECCV 2016 Workshops. pp. 236–250. Springer (2016)
 [26] Sioutos, N., de Coronado, S., Haber, M.W., Hartel, F.W., Shaiu, W.L., Wright, L.W.: NCI thesaurus: a semantic model integrating cancerrelated clinical and molecular information. Journal of biomedical informatics 40(1), 30–43 (2007)
 [27] Vicente, S., Carreira, J., Agapito, L., Batista, J.: Reconstructing PASCAL VOC. In: CVPR. pp. 41–48 (2014)
 [28] Wu, J., Xue, T., Lim, J.J., Tian, Y., Tenenbaum, J.B., Torralba, A., Freeman, W.T.: Single image 3D interpreter network. In: ECCV. pp. 365–382. Springer (2016)
 [29] Wu, J., Zhang, C., Xue, T., Freeman, B., Tenenbaum, J.: Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In: Advances in Neural Information Processing Systems. pp. 82–90 (2016)
 [30] Yang, D., Xu, D., Zhou, S.K., Georgescu, B., Chen, M., Grbic, S., Metaxas, D., Comaniciu, D.: Automatic liver segmentation using an adversarial imagetoimage network. In: MICCAI. pp. 507–515. Springer (2017)
 [31] Zhang, Y., Miao, S., Mansi, T., Liao, R.: Task driven generative modeling for unsupervised domain adaptation: Application to Xray image segmentation. In: MICCAI (2018)
 [32] Zhu, Y., Prummer, S., Wang, P., Chen, T., Comaniciu, D., Ostermeier, M.: Dynamic layer separation for coronary DSA and enhancement in fluoroscopic sequences. In: MICCAI. pp. 877–884. Springer (2009)
Comments
There are no comments yet.