Deep Efficient End-to-end Reconstruction (DEER) Network for Low-dose Few-view Breast CT from Projection Data

by   Huidong Xie, et al.
Rensselaer Polytechnic Institute

Breast CT provides image volumes with isotropic resolution in high contrast, enabling detection of clarifications (down to a few hundred microns in size) and subtle density differences. Since breast is sensitive to x-ray radiation, dose reduction of breast CT is an important topic, and for this purpose low-dose few-view scanning is a main approach. In this article, we propose a Deep Efficient End-to-end Reconstruction (DEER) network for low-dose few-view breast CT. The major merits of our network include high dose efficiency, excellent image quality, and low model complexity. By the design, the proposed network can learn the reconstruction process in terms of as less as O(N) parameters, where N is the size of an image to be reconstructed, which represents orders of magnitude improvements relative to the state-of-the-art deep-learning based reconstruction methods that map projection data to tomographic images directly. As a result, our method does not require expensive GPUs to train and run. Also, validated on a cone-beam breast CT dataset prepared by Koning Corporation on a commercial scanner, our method demonstrates competitive performance over the state-of-the-art reconstruction networks in terms of image quality.


page 4

page 5

page 6

page 8

page 9

page 11


Deep-learning-based Breast CT for Radiation Dose Reduction

Cone-beam breast computed tomography (CT) provides true 3D breast images...

Can Deep Learning Outperform Modern Commercial CT Image Reconstruction Methods?

Commercial iterative reconstruction techniques on modern CT scanners tar...

Learned Experts' Assessment-based Reconstruction Network ("LEARN") for Sparse-data CT

Compressive sensing (CS) has proved effective for tomographic reconstruc...

Dual Network Architecture for Few-view CT --Trained on ImageNet Data and Transferred for Medical Imaging

X-ray computed tomography (CT) reconstructs cross-sectional images from ...

End-to-End Abnormality Detection in Medical Imaging

Nearly all of the deep learning based image analysis methods work on rec...

I Significance Statement

Breast CT improves detection and characterization of breast cancer, with a potential to become a primary breast imaging tool. Currently, the average glandular dose of a typical breast CT scanner is between 7 and 13.9 mGy, while the radiation dose threshold set by FDA is 6 mGy. Our Deep Efficient End-to-end Reconstruction (DEER) network cuts the nominal number of projections (300 views) down to a quarter of that number without compromising image quality by directly mapping sinogram data to CT images directly, lowering the radiation well under the FDA threshold. Also, the DEER network improves the computational complexity by orders of magnitude relative to the state of the art networks that maps sinogram data to tomographic images.

Ii Introduction

According to the American Cancer Society, breast cancer remains the second leading cause for cancer death among women in the United States. Approximately 40,000 people die from breast cancer each year smith_improvement_2011. The chance of a woman having this disease during her life is 1 in 8. The wide use of x-ray mammography, which can detect the breast cancer at the early stage, has helped reduce the death rate. Five-year relative survival rates by stage at diagnosis for breast cancer patients are 98 (local stage), 84 (regional stage), and 23 (distant stage) respectively henson_relationship_1991. These data indicate that detection at an early stage plays a crucial role in significantly improving prognosis of breast cancer patients. Therefore, the development of breast imaging techniques with high performance will directly benefit these patients.

Mammography is a 2D imaging technique without depth information, severely degrading image contrast. While breast tomosynthesis is a pseudo-3D imaging technique, breast CT provides an image volume of high quality and promises a superior diagnostic performance. Indeed, CT is one of the most essential imaging moralities extensively used in clinical practice brenner_computed_2007. Although CT brings overwhelming healthcare benefits, it may potentially increase cancer risk due to the involved ionizing radiation. Since breast is particularly sensitive to x-ray radiation, dose reduction of breast CT is directly health-care relevant. If the effective dose of routine CT examinations is reduced to 1 mSv per scan, the long-term risk of CT scans can be considered negligible. The average mean glandular dose of a typical breast CT scanner ranges between 7 and 13.9 mGy, while the standard radiation dose currently set by the Food and Drug Administration (FDA) is < 6 mGy. This gap demands major research efforts.

In the past years, several deep-learning-based low-dose CT denoising methods were proposed to reduce radiation dose with excellent results shan_3-d_2018; chen_low-dose_2017-1; shan_competitive_2019. In parallel, few-view CT is also a promising approach to reduce the radiation dose, especially for breast CT pacile_clinical_2015 and C-arm CT floridi_c-arm_2014; orth_c-arm_2009. Moreover, few-view CT may be implemented in mechanically stationary scanners in the future avoiding all problems associated with a rotating gantry. Recently, data-driven algorithms have shown a great promise to solve the few-view CT problem.

In this article, we propose a Deep Efficient End-to-end Reconstruction (DEER) network for low-dose few-view breast CT. The major merits of our network include high dose efficiency, excellent image quality, and low model complexity. In the proposed DEER method, a data-point-wise fully-connected layer learns the ray-tracing-type process, requesting as low as parameters where denotes the size of a reconstructed image. The complexity of the DEER network is significantly less than the prior art networks at least by a factor of and up to orders of magnitude. Our experimental results demonstrate that DEER produces a competitive performance over these state-of-the-art methods.

Few-view CT is a hot topic in the field of tomographic image reconstruction. Because of the requirement imposed by the Nyquist sampling theorem landau_sampling_1967, reconstructing high-quality CT images from under-sampled data is traditionally considered impossible. When sufficient projection data are acquired, analytic methods such as filtered back-projection (FBP) wang_approximate_2007 are widely used for accurate image reconstruction. In the few-view CT circumstance, severe streak artifacts are introduced in analytically reconstructed images due to the incompleteness of projection data. To overcome this issue, various iterative techniques were proposed, which can incorporate prior knowledge in the image reconstruction process. Well-know methods include algebraic reconstruction technique (ART) gordon_algebraic_1970, simultaneous algebraic reconstruction technique (SART) andersen_simultaneous_1984

, expectation maximization (EM)


, etc., and can be enhanced with various penalty terms. Nevertheless, these iterative methods are time-consuming and still fail to produce satisfying results in many challenging cases. Recently, deep learning becomes very popular due to the development of neural network, high-performance computing (such as graphics processing unit (GPU)) and big data science and technology. In particular, deep learning has now become a new frontier of CT reconstruction research

wang_perspective_2016; wang_guest_2015; wang_image_2018.

In the literature, only a few deep learning methods were proposed for reconstructing images directly from raw data. Zhu et al. zhu_image_2018 use fully-connected layers to learn the mapping from raw k-space data to the corresponding MRI image with parameters, where denotes the size of a reconstructed image. There is no doubt that a similar technique can be implemented to learn the mapping from the projection domain to the image domain for CT or other tomographic imaging modalities, as clearly explained in our perspective article wang_perspective_2016. However, importing the whole sinogram into the network requires a huge amount of memory and represents a major computational challenge to train the network for a full-size CT image/volume on commercial GPUs. Moreover, using fully-connected layers to learn the mapping assumes that every single point in the data domain is related to every single point in the image domain. While this assumption is generally correct, it does not utilize the intrinsic structure of the tomographic imaging process. In the case of CT scanners, x-rays generate line integrals from different angles over a field of view (FOV) for image reconstruction. Therefore, there are various degrees of correlation between projection data within each view and at different orientations. Würfl et al. wurfl_deep_2018 replaces the fully-connected layer in the network with a back-projection operator to reduce the computational burden. Even though their method reduces the memory cost by not storing a large matrix in the GPU, the back-projection process is no longer learnable. A recently proposed deep learning based CT reconstruction method li_learning_2019, known as the iCT-Net, uses multiple small fully-connected layers and incorporates the viewing-angle information in learning the mapping from sinograms to images. iCT-Net reduces the computational complexity from for the network by Zhu et al. to , where denotes the number of CT detector elements. In most CT scanners, is usually equal to or greater than . The complexity is still large for CT reconstruction.

Here we propose a Deep Efficient End-to-end Reconstruction (DEER) network for low-dose few-view breast CT. The major merits of our network include high dose efficiency, excellent image quality, and low model complexity. Computationally, the number of parameters required by DEER is as low as . During the training process, the number of parameters is set to , where is the number of projections. In the few-view CT case, should be much less than which is in a favorable comparison to the complexity of iCT-Net.

The proposed DEER is inspired by the well-known filtered backprojection mechanism, and designed to learn a refined filtration and backprojection for data-driven image reconstruction. As a matter of fact, every point in the sinogram domain only relates to pixels/voxels on a single X-ray path through an FOV. This means that line integrals acquired by different detector elements at a particular angle are not directly related to each other. Also, after an appropriate filtering operation, a filtered projection profile must be smeared back over the FOV. These two ray-oriented processes suggest that the reconstruction process can be, to a large degree, learned in a point-wise manner, which is the main idea of the DEER network to reduce the memory burden. Moreover, to further alleviate the memory burden, the proposed DEER method learns the reconstruction process separately by splitting the input sinograms into 2 parts (one contains the values from oddly indexed detector elements and the other contains values from evenly indexed leftovers), reducing the required trainable parameters by a factor of 2.

The rest of the paper is organized as follows. In the next section, we present our DEER network. In the third section, we describe the experimental design, training data and reconstruction results. Finally, in the last section we discuss relevant issues and conclude the paper.

Iii Methodology

iii.1 Proposed Framework

CT image reconstruction can be expressed as follows:


where is an image of pixels, is the sinogram of data, and stand for full-view and sparse-view respectively, and denotes an inverse transform barrett_iii_1984; barrett_fundamentals_1988 such as FBP in the case of sufficient 2D projection data. Alternatively, CT image reconstruction can be also transformed to a problem of solving the system of linear equations. That is, an iterative solver can implement an inverse transform. Ideally, the FBP method produces satisfying results when sufficient high-quality projection data are available. However, when the number of linear equations is less than the number of unknown pixels/voxles in the few-view CT setting, image reconstruction becomes an undetermined problem, and even an iterative algorithm cannot reconstruct satisfactory images in difficult cases. Recently, deep learning (DL) provides a novel way to extract features of raw data for image reconstruction. With a deep neural network, training data can be utilized as strong prior knowledge to establish the relationship between a sinogram and the corresponding CT image, efficiently solving this undetermined problem.

Fig. 1 shows the overall workflow of the proposed DEER network. This network is empowered in the Wasserstein Generative Adversarial Network arjovsky_wasserstein_2017 (WGAN) framework, which is one of the most advanced architectures in the deep learning field. In this study, the proposed framework consists of two components: a generator network and a discriminator network . aims at reconstructing images directly from a batch of few-view sinograms. receives images from either or a ground-truth dataset, and intends to distinguish whether the input image is real (the ground-truth) or fake (from ). Both networks can optimize themselves in the training process. If an optimized network can hardly distinguish fake images from real ones, then we say that generator can fool discriminator , which is the goal of WGAN. By design, the network also helps improve the texture of the final image and prevent over-smoothing from occurring.

Figure 1: Overall workflow of the proposed DEER network. The numbers below each block indicate the dimensionality of the block. The images are real examples. The display window for the final output is [-200,200] in the Hounsfield Unit for clear visualization of lesions. The blue boxes indicate various networks while the orange boxes are for different objective functions used to optimize the network.

Different from the vanilla generative adversarial network (GAN) goodfellow_generative_2014-1

, WGAN replaces the cross-entropy loss function with the Wasserstein distance, improving the training stability. In the WGAN framework, the 1-Lipschitez function is assumed with weight clipping. However, it was pointed out

gulrajani_improved_2017 that weight clipping may be problematic in WGAN, and it can be replaced with a gradient penalty, which is implemented in our proposed framework. Hence, the objective function of the network is expressed as follows:


Where and represent sparse-view sinograms and ground-truth images respectively, denotes the expectation of b as a function of a, and represent the trainable parameters of networks and respectively, , and is uniformly sampled from the interval [0,1]. In other words, represents images between fake and real images. denotes the gradient of with respect to , and is a parameter used to balance the Wasserstein distance term and gradient penalty term. As suggested in goodfellow_generative_2014-1; arjovsky_wasserstein_2017; gulrajani_improved_2017, the networks and are updated alternatively.

iii.2 Generator Network

The overall structure of the generator network is illustrated in Fig. 4

. The input to the DEER is a batch of few-view fan-beam sinograms. The fan-beam sinogram is first re-binned to the parallel-beam sinogram using linear interpolation. Then, the ramp filter is applied on the processed sinogram. According to the Fourier slice theorem

bracewell_projection-slice_2003, low-frequency information is sampled at a denser rate than high-frequency information. If back-projection is performed directly, reconstructed images will become blurry. The ramp filter is usually used to filter the sinogram and to address this blurry issue. In DEER, the ramp filter is performed in the Fourier domain via multiplication to reduce training time. Then, the filtered sinogram data are passed into the generator network . The network learns a network-based back-projection and outputs a reconstructed image. This network can be divided into two components: back-projection and refinement. First, the filtered sinogram data are passed into the back-projection part of the network . This part aims at reconstructing an image from projection data. The input sinogram data are actually separated into 2 parts. One contains values from oddly indexed detector elements and the other contains values from evenly indexed leftovers. By doing so, the amount of required trainable parameters can be reduced by a factor of 2. As illustrated in Fig. 2, the reconstruction algorithm is inspired by the following intuition: every point in the sinogram only relates to pixel values on the associated x-ray path through the underlying image, and other pixels contribute little to it. With this intuition, the reconstruction process is learned in a point-wise manner using a point-wise fully-connected layer, and DEER can truly learn the back-projection process with as few as parameters, thereby reducing the memory overhead. Put differently, for a sinogram with dimensionality of , there is only a total of

small fully-connected layers in the proposed network. The input to each of these small fully-connected layers is a single point in the sinogram domain, and the output is a line-specific vector of

elements. After this point-wise fully-connected layer, rotation and summation are applied to simulate the FBP method, putting all the learned lines back to where they should be. Bilinear interpolation gribbon_novel_2004 is used to keep the rotated images on a Cartesian grid. This network design allows the neural network to learn the reconstruction process with only parameters if all the small fully-connected layers share the same weight. However, due to the complexity of medical images and incomplete projection data, parameters are not sufficient to produce high-quality images. Therefore, we increase the number of parameters to for this point-wise learning network. That is, we use different sets of parameters for different angles to compensate for artifacts from bilinear interpolation and other factors. Moreover, the number of bias terms in the point-wise fully-connected layer is the same as the number of weights to learn fine details in medical images. It should be noted that by learning in this point-wise manner, every single point in the sinogram becomes a training sample. After the reconstruction part, 2 images will be acquired, one is generated from the sinogram with values from oddly indexed detector elements and the other from evenly indexed detector elements. Two images are concatenated together as the input to the refinement part of the network .

Inspired by the ResNeXt structure proposed by Xie et al. xie_aggregated_2016, the cardinality of the backprojection part is experimentally adjusted to 8 for an improved performance, which means multiple mappings are learned from the projection data and then added together. In the DEER network, the cardinality can be understood as the number of branches. As demonstrated in xie_aggregated_2016, increasing the cardinality of the network is more effective than increasing the depth or width of the network when we want to enhance the network capacity. The ResNext structure also outperforms the well-known ResNet structure. he_deep_2015.

Figure 2: Backprojection part in the network , where the orange line denotes the evenly indexed detector elements containing a total of 512 measured values. The blue, yellow and red curves indicate the trainable weights of the network. The same color denotes the same weight. While the black dots are measure projection data, the blue dots indicate the learned pixel values.

Images reconstructed in the back-projection part are fed into the refinement portion of the network . Although the proposed filtration and back-projection parts do learn a refined FBP method, streak artifacts cannot be perfectly eliminated. The refinement part of is used to remove remaining artifacts. It is a typical U-net ronneberger_u-net:_2015 with conveying paths, and built in the ResNeXt structure. U-net was originally designed for image segmentation and has been utilized in various medical imaging applications. For example, shan_3-d_2018; chen_low-dose_2017 used U-net with conveying paths for CT images denoising, jin_deep_2017; lee_deep-neural-network-based_2019 for few-view CT, and quan_compressed_2018 for sparse data MRI donoho_compressed_2006

. The conveying paths copy early feature maps and reuse them as part of the input to later layers. Concatenation is used to combine early and later feature maps along the channel direction. The network can therefore preserves high-resolution features. Each layer in the proposed U-net is followed by a rectified linear unit (ReLU). Kernels of

are used in both convolutional and transpose-convolutional layers in the encoder-decoder network. A convolutional layers with kernel and a ResNeXt are used to up-sampled the resultant feature maps from to

. A stride of 2 is used for down-sampling and up-sampling layers, and stride of 1 is used for all other layers. To maintain the tensor’s size, zero-padding is used. The proposed ResNext in the refinement part is illustrated in Fig.


Figure 3: ResNeXt block used in the network , where the orange arrows indicate a convolutional operation followed by a ReLU activation. The numbers below each block indicate its dimensionality.

iii.3 Discriminator Network

The discriminator network takes an image from either

and the ground-truth dataset, trying to distinguish whether the input is real or fake. The discriminator network has 6 convolutional layers with 64, 64, 128, 128, 256, 256 filters respectively, which are followed by 2 fully-connected layers with the number of neurons 1024 and 1 respectively. The leaky ReLU activation function is enforced after each layer with a slope of 0.2 in the negative part. Convolution operations are performed with

windowing and zero-padding for all convolutional layers. Stride equals to 1 for odd layers and 2 for even layers.

Figure 4: Proposed network , where the numbers below each block indicate its dimensionality. The images are real examples. The display window for the final image is [-300,300] in the Houcefield Unit. Intermediate results are not normalized. Note that in the refinement part network, there is a ResNeXt block (gray boxes) after each convolutional operation. The green box presents legends. For simplicity, the figure only shows the case when the cardinality is equal to 1.

iii.4 Objective Functions for the Generator Network

The objective function used for optimizing the generator involves the mean square error (MSE) chen_low-dose_2017; wolterink_generative_2017, adversarial loss wu_cascaded_2017; yang_low-dose_2018, structural similarity index (SSIM) zhou_wang_image_2004; you_structurally-sensitive_2018, and perceptual loss johnson_perceptual_2016; yang_low-dose_2018. MSE is a most popular choice for many applications, and can effectively suppress background noise wang_mean_2009, but it could result in over-smoothed images zhao_loss_2017. Moreover, MSE is not sensitive to image texture since it assumes that background has a white Gaussian noise which is independent of image features zhou_wang_image_2004. The formula of the MSE loss is expressed as follows:


where , and denote the number of batches, the width and height of involved images, and represent the ground-truth and image reconstructed by the network respectively. To compensate for the disadvantages of MSE and acquire better images, SSIM is introduced in the objective function. SSIM measures structural similarity between two images. The convolution window used to measure SSIM is set to . The SSIM formula is expressed as follows:


where and are constants to stabilize the ratios when the denominator is too small, stands for the dynamic range of pixel values, often times , , , , , and are the means of and , deviations of and , and the correlation between and respectively. Then, the structural loss becomes the following:


The adversarial loss helps the generator network produce faithful images that are indistinguishable by the discriminator network. In reference to Eq. 2, the adversarial loss is expressed as follows:


Finally, the perceptual loss is introduced as part of the objective function to reserve high-level features. It computes the distance between and

in a high-level feature space by a feature-extraction function

, driving the network to generate images that have a visually desirable features of interest to aid radiologists. Following the ideas described in yang_low-dose_2018; sajjadi_enhancenet:_2017; dosovitskiy_generating_2016; johnson_perceptual_2016, the well-known VGG-19 network simonyan_very_2015 is chosen as the feature extraction function . The VGG-19 network contains 16 convolutional layers followed by 3 fully-connected layers. The output from the convolutional layer is treated as the features to compute the perceptual loss. Mathematically the perceptual loss is formulated as follows:


The overall objective function of is then summarized as follows:


where , , and are hyper-parameters to balance different loss functions.

Iv Experiments and Results

iv.1 Datasets and Data Pre-processing

A clinical female breast dataset was used to train and evaluate the performance of the proposed DEER method. The dataset was generated and prepared by Koning Corporation. The data were acquired on a state-of-the-art breast CT scanner produced by Koning Corporation. Totoally, 18,378 CT images were acquired from 42 patients. All the images were reconstructed from 300 projections made at 42 peak kilovoltage (kVp), which were used as the ground-truth images to train the proposed network. The distance between the x-ray source and the patient is 650 milimeters, while the distance between the patient and detector array is 273 milimeters. All the images are of . Totally, 30 patients were randomly selected for training (14,028 images), and the remaining 10 patients (4,350 images) were selected for testing/validation. For patient data, fan-beam sinograms were acquired through the fan-beam Radon transform kak_principles_2002 under the described acquisition conditions. Then, the fan-beam sinograms were converted to parallel-beam sinograms via linear interpolation. The interpolated sinograms were used as input to the proposed DEER network.

iv.2 Experimental Design

For hyper-parameter selection, the hyperparameter

used to balance the Wasserstein distance and the gradient penalty was set to 10, as suggested in the original paper gulrajani_improved_2017, while , , and

were experimentally adjusted on the validation dataset. A batch size of 3 was used for training. All code were implemented on the TensorFlow platform

abadi_tensorflow:_nodate using an NVIDIA Titan RTX GPU. The Adam optimizer optimized the parameters kingma_adam:_2015 with and .

For qualitative assessment, we compared DEER with 3 state-of-the-art deep learning methods, including FBPConvNet jin_deep_2017, DEAR-2D xie_deep_2019 and residual-CNN cong_deep-learning-based_2019. All of these three methods are image-domain methods, which take analytical FBP images as the input, trying to remove few-view artifacts through convolutional layers. To our best knowledge, the network settings we used were the same as the network settings described in the original publications. The dataset used to train all the networks were the same in this study. All the patient images were preprocessed in the same way as described in the original papers. Since iCT-Net li_learning_2019 is not able to train images directly on our hardware, we did not compare with iCT-Net in this study. However, we did compare with iCT-Net in our previous work when the image size was not too large xie_dual_2019. In our previous work, both networks were trained using images for fair comparison. To our best knowledge, the network setting was the same as the default settings described in the original paper on iCT-Net, with a conclusion favoring our proposed network xie_dual_2019. We also compared the proposed method with the classic method SART-TV lu_few-view_2012 for iterative reconstruction.

iv.3 Comparison With Other Deep-learning & Iterative Methods

To visualize performance of different methods, a few representative slices were selected from the testing dataset. Fig. 5 shows results reconstructed using different methods from 75-view projections. For better evaluation of the image quality, the region-of-interst (ROIs) marked in the blue boxes in Fig. 5 are magnified in Figs. 6, 7 and 8

. Four metrics, Peak Signal to Noise Ratio (PSNR)

korhonen_peak_2012, SSIM zhou_wang_image_2004; you_structurally-sensitive_2018, Root Mean Square Error (RMSE)willmott_advantages_2005, and PL (perceptual loss) johnson_perceptual_2016; yang_low-dose_2018 were computed for quantitative assessment. The quantitative results are shown in Table 1.

Figure 5: Representative slices reconstructed using different methods for Koning dataset. (a) Ground-truth, (b) FBP, (c) SART-TV, (d) FBPConvNet, (e) residual-CNN, (f) DEAR-2d, (g) DEER. The blue boxes mark the Region of Interest (ROIs). The display window is [-200, 200] HU for better visualizing breast details.
Figure 6: Zoomed-in ROIs (The blue boxes in the first row of Fig. 5). (a) Ground-truth, (b) FBP, (c) SART-TV, (d) FBPConvNet, (e) residual-CNN, (f) DEAR-2d, (g) DEER. The display window is [-200, 200] HU for better visualizing breast details. The blue and orange arrows indicate some subtle details.
Figure 7: Zoomed-in ROIs (The blue boxes in the second row of Fig. 5). (a) Ground-truth, (b) FBP, (c) SART-TV, (d) FBPConvNet, (e) residual-CNN, (f) DEAR-2d, (g) DEER. The display window is [-200, 200] HU for better visualizing breast details. The blue and orange arrows indicate some subtle details.
Figure 8: Zoomed-in ROIs (The blue boxes in the third row of Fig. 5). (a) Ground-truth, (b) FBP, (c) SART-TV, (d) FBPConvNet, (e) residual-CNN, (f) DEAR-2d, (g) DEER. The display window is [-200, 200] HU for better visualizing breast details. The blue and orange arrows indicate some subtle details.

The ground-truth images and the corresponding few-view FBP images are presented in Fig. 5a and Fig. 5b respectively. Streak artifacts are clearly visible in the FBP images. Fig. 5 (d), (e), (f), present results reconstructed using FBPConvNet, ResidualCNN, and DEAR-2d respectively. These image-domain methods can effectively suppress streak artifacts but they can potentially miss or smooth out subtle details in the images which are crucial for diagnoses (as pointed out by the arrows in Fig. 6). Moreover, the image-domain methods may distort some subtle details that are correctly reconstructed using the FBP method (as pointed our by the orange arrows in Fig. 7). It is worth noting that FBPConvNet introduces artifacts that do not exist in the FBP image (the blue arrows in the second row of Fig. 7). For details that almost indistinguishable in the few-view image (as pointed by the orange arrows in Fig. 8), these image-domain methods are unable to recover them through convolutional operations. However, since these subtle image features are still embedded in the projection data, our proposed DEER can recover these details, which can be for clinical practice. Compared with the other deep learning methods, the proposed DEER demonstrates a competitive performance in removing artifacts and reserving subtle but vital details than the other methods. Thanks to the implementations of WGAN and perceptual loss, the proposed DEER is also better in reserving image texture than the other methods. Lastly, SART-TV image reconstruction from few-view fan-beam projection data is time-consuming and failed to produce high-quality results. The images reconstructed using SART-TV tend to over-smooth details.

FBP FBPConvNet residual-CNN DEAR-2d DEER
Table 1: Quantitative assessments on different methods (). For each metric, the best result is marked as bold. The measurements were obtained by averaging the values in the testing set.

iv.4 Learning with parameters

By learning the reconstruction process in a point-wise manner, the proposed method demonstrates a FBP-like behavior and can be applied to various numbers of views, especially few views. An experiment was performed to demonstrate this point. A network was built using parameters (1,024 training parameters in this case) and the network is the same as the backprojection part described above, and the network used the same objective function. WGAN was not implemented in this network for speeding up the training process. This network was denoted as FBPNet. Hyperparameters for FBPNet were adjusted accordingly as and . The backprojected feature maps need to be scaled by and followed by a ReLU activation, where denoted the number of views in testing sinograms. This correction makes sure that the reconstructed images are as bright as the ground-truth and the ReLU activation ensures that the reconstructed images contain only positive values (the training phantom image was normalized between for training). The training set of this network contained only one sinogram acquired from a computed-generated phantom image, as shown in Fig. 9 (j). The sinogram contains 75 projections acquired from angles equally distributed in the interval . To demonstrate that the learned network is not overfitting to the sinogram, we tested the learned network using the Koning breast dataset. Please note that instead of using different sets of parameters for different angles, this network shares the same parameters for all angles, providing the network with the ability to broadcast to any number of views. We also tested the learned network with sinograms acquired under different acquisition conditions, including few-view, full-view, and limited-angle. The training time was within 10 minutes on an NVIDIA Titan RTX GPU. Results are shown in Fig. 9. The corresponding quantitative assessments are shown in Table 2. It is worth mentioning that proposed FBPNet constantly outperforms FBP in terms of structural measurements (SSIM). On the other hand, FBP can be better than FBPNet in terms of pixel-wise measurements (PSNR and RMSE). Moreover, the proposed FBPNet was trained with only one 75-view sinogram and is applied to other acquisition conditions. Performance will be improved if the FBPNet is trained using more data under the same acquisition condition as the testing data.

Figure 9: Results reconstructed by the FBPNet and by FBP under various acquisition conditions as well as the phantom image used to train the FBPNet. (a) Ground-truth (300 views), (b) FBP (75 views), (c) FBP (30 views), (d) FBP (30 views acquired in the interval ), (e) DEER (75 views), (f) FBPNet (300 views), (g) FBPNet (75 views), (h) FBPNet (30 views), (i) FBPNet (30 views acquired in the interval ), (j) phantom image used to train the FBPNet. The display window is [-200, 200] HU for better visualizing breast details. The blue and orange arrows point to the same subtle details as indicated in Fig. 7
FBP (300 views) FBP (75 views) FBP (30 views) FBP (30 views in )
FBPNet (300 views) FBPNet (75 views) FBPNet (30 views) FBPNet (30 views in )
Table 2: Quantitative assessments on FBPNet and FBP (). Parallel-beam FBP was implemented for comparisons in this table. The measurements were obtained by averaging the values in the testing set.

V Discussions And Conclusion

In the future, few-view CT may be implemented in a mechanically stationary scanner cramer_stationary_2018 for health-care utilities. Current commercial CT scanners use one or two x-ray sources which are mounted on a rotating gantry, and take hundreds of projections at different angles around the patient body. The rotating mechanism is not only massive but also power-consuming due to the net angular momentum generated during the rotation. Therefore, current commercial CT scanners are largely inaccessible outside hospitals and clinics. Designing a module with multiple miniature x-ray sources is one approach to resolve this issue cramer_stationary_2018. Few-view CT becomes a potentially very attractive option.

This paper has introduced a novel approach for reconstructing CT images directly from under-sampled projection data, referred to as the Deep Efficient End-to-end Reconstruction (DEER) Network. This approach is featured by (1) the Wasserstein GAN framework for optimizing parameters, (2) a convolutional encoder-decoder network with conveying-paths, allowing the network to reuse previous feature-maps, and preserve early high-resolution features, and (3) a powerful ResNeXt-like architecture for improving performance and reducing parameters, (4) an experimentally optimized objective function. In addition to the conceptually simple and practically effective DEER network, an end-to-end strategy has been applied to learn the mapping from sinogram domain to image domain, requesting significantly less computational burden than prior arts. Zhu et al zhu_image_2018 published the first method for learning a network-based reconstruction algorithm for medical imaging. They used several fully-connected layers to learn the mapping between MRI raw-data and an underlying image directly. But their method poses a severe challenge for reconstructing normal images due to extremely large matrix multiplications in the fully-connected layers. Additionally, even though an improved technique could be applied to CT images, using fully-connected layers to learn the mapping does not use the full information embedded in the sinograms, resulting in redundant network parameters. iCT-Net li_learning_2019 utilizes angular information and reduces the number of parameters from to . Instead of feeding the whole sinogram directly into the network, iCT-Net uses fully-connected layers, each takes a single projection and reconstructs a corresponding intermediate image component. However, iCT-Net requires an expensive professional GPU to train. During CT scans, X-rays are used to take line integrals at different angles around the patient, and the images are reconstructed using the analytic FBP method, filtering them and smearing back the filtered results along the X-ray paths. Using FBP to reconstruct images from under-sampled data resulting in intolerable artifacts. The intuition of DEER is to learn a better filtering and backprojection process using deep-learning techniques. DEER takes full advantage of all the information embedded in the sinograms by utilizing the angular information similar to what the iCT-Net does and also assuming every single point in the sinogram is only related to reconstructing pixels along the associated X-ray path. With this intuition, DEER can be trained on a consumer-based GPU with parameters. Moreover, instead of learning only one mapping from the sinogram, DEER allows the network to learn multiple mappings. DEER is therefore able to gather as much information as possible, improving the imaging performance. Moreover, with this design, the reconstruction process can be learned with as low as parameters if all the angles share the same training parameters during the backprojection procedure.

The method presented in this paper provides a few possibilities for future research. For example, (1) the input to the network could be a noisy sinogram and instead of denoising images directly, DEER can perform the whole low-dose reconstruction procedure. (2) DEER could also be applied to interior tomography wang_meaning_2013 problem by setting the length of back-projection the same as the pre-determined FOV.

DEER can be easily applied to 3D cone-beam CT image reconstruction with sufficient computational power. As demonstrated in our previous work xie_deep_2019, multiple adjacent input slices can be used as the input to the neural network, allowing the network to capture spatial information, therefore producing better results. Since artifacts in medical images exist in a 3D space, it is reasonable to learn a denosing or de-artifacting model using 3D instead of 2D input.

In conclusion, we have presented a novel network-based reconstruction algorithm for few-view CT. The proposed method outperforms previous deep-learning-based methods with significantly less memory burden and higher computational efficiency. In the future, we plan to further improve this network for direct cone-beam 3D breast CT reconstruction and translate it into clinical applications.