TomoGAN: Low-Dose X-Ray Tomography with Generative Adversarial Networks

02/20/2019 ∙ by Zhengchun Liu, et al. ∙ 14

Synchrotron-based x-ray tomography is a noninvasive imaging technique that allows for reconstructing the internal structure of materials at high spatial resolutions. Here we present TomoGAN, a novel denoising technique based on generative adversarial networks, for improving the quality of reconstructed images for low-dose imaging conditions, as at smaller length scales where higher radiation doses are required to resolve sample features. Our trained model, unlike other machine-learning-based solutions, is generic: it can be applied to many datasets collected at varying experimental conditions. We evaluate our approach in two photon-budget-limited experimental conditions: (1) sufficient number of low-dose projections (based on Nyquist sampling), and (2) insufficient or limited number of high-dose projections. In both cases, angular sampling is assumed to be isotropic, and the photon budget throughout the experiment is fixed based on the maximum allowable radiation dose. Evaluation with both simulated and experimental datasets shows that our approach can reduce noise in reconstructed images significantly, improving the structural similarity score for simulation and experimental data with ground truth from 0.18 to 0.9 and from 0.18 to 0.41, respectively. Furthermore, the quality of the reconstructed images with filtered back projection followed by our denoising approach exceeds that of reconstructions with simultaneous iterative reconstruction.



There are no comments yet.


page 2

page 3

page 7

page 8

page 9

page 11

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


X-ray computed tomography (CT) is a common noninvasive imaging modality for resolving the internal structure of materials at synchrotrons [bonse1996x]. In CT, 2D projection images of an object are collected at different views of the object around a common axis, and a numerical reconstruction process is applied afterwards to recover the object’s morphology in 3D. Although CT experiments at synchrotrons can collect data at high spatial and temporal resolution, however, in situ or dose-sensitive experiments require short exposure times to capture relevant dynamic phenomena or to avoid sample damage. These low-dose (LD) imaging conditions yield noisy measurements that significantly impact the quality of the resulting 3D reconstructions. Similar concerns arise when the number of projections is limited to meet speed and/or dose requirements, such as in conventional lab-based micro-CT systems, where measurements are collected at discrete rotations of the object. Thus, we want techniques that can map noisy reconstructions (due to fewer projections, lower resolutions, and/or shorter imaging times with noisy measurements) to an approximation of the ideal image.

Much research has been conducted on methods for improving the quality of noisy low-dose images. Broadly, these approaches fall into three categories: (i) methods for denoising measurements/raw data, for example, sinograms or projections [wang2005sinogram, wang2006penalized, manduca2009projection]; (ii) advanced iterative reconstruction algorithms, for example, model-based approaches [mohan:2014, vogel1996iterative, beister2012iterative]; and (iii) methods for denoising reconstructed images [ma2011low]

. For (i) and (iii), various deep learning (DL) approaches have shown great promise 

[8332971, Lovric:xk5008, wolterink2017generative, deep-img, jimaging4110128, 2016arXiv160806993H]. In the context of synchrotron-based CT, for (i), Yang et al. [Yang2018]

used a deep convolutional neural network (CNN) to denoise pre-reconstruction short-exposure-time projection images from a network trained by a few long-exposure-time (i.e., high-dose) and short-exposure-time (i.e., low-dose) projection pairs. They achieved a 10-fold increase in signal-to-noise ratio, enabling the reliable tracing of brain structures in low-dose datasets. For (iii), Pelt et al. 

[jimaging4110128] trained a mixed-scale dense convolutional neural network[2016arXiv160806993H] in a supervised fashion to learn a mapping from low-dose to normal-dose reconstructions. They achieved impressive results on simulation datasets, but when evaluating on an experimental dataset, they only used different portions of the same dataset for training and testing.

DL techniques use multilayer (“deep”) neural networks (DNNs) to learn representations of data with multiple levels of abstraction. These techniques can discover intricate structure in a dataset by using a back-propagation algorithm to set the internal parameters that are used to transform data as they flow between network layers. Recent advances in DL, such as convolutional networks [LeCun2015]

, rectifier linear units (ReLUs) 


, batch normalization 

[BatchNorm], dropout [Dropout], and residual learning [residual]

, have enabled exciting new applications in many areas. DL techniques have been applied successfully to a range of scientific imaging problems, such as denoising, super-resolution, segmentation, and image enhancement and restoration 

[ARAUJO201913, dula:ML:2017, 2017arXiv171110925U, srgan, pix2pix2017].

Image produced by conventional reconstruction.
Image following enhancement by TomoGAN.
Figure 1: Two different reconstructions of a noisy simulated dataset, constructed by subsampling 64 projections from a 1,024-projection simulated dataset containing foam features in a 3D volume. On the left, the results of conventional reconstruction, which are highly noisy. On the right, those same results after denoising with TomoGAN; the features are much more visible. In these images and others that follow, an inset shows details of a representative feature.
Image produced by conventional reconstruction.
Image following enhancement by TomoGAN.
Figure 2: Two different reconstructions of a noisy experimental dataset, constructed by subsampling 64 projections from a 1,024-projection shale sample dataset. On the left, the results of conventional reconstruction, which are highly noisy. On the right, those same results after denoising with TomoGAN; the features are much more visible.

In this article we explore an alternative DL approach to image enhancement, namely, the use of generative adversarial networks (GANs). GAN approaches are unsupervised (or weakly supervised) and can learn from limited training data, which makes them especially suitable for experimental data collected at synchrotron light sources. In general, a GAN involves two neural networks, a generator and a discriminator , that contest with each other in a zero-sum-game framework[2014arXiv1406.2661G]. Training a GAN model involves a minimax game between the that mimics the true distribution and the that distinguishes samples produced by the from the real samples. This approach not only is more resistant to overfitting but also allows for quality enhancement with much less data than required for conventional supervised DL. GANs have been applied successfully in medical imaging [8353466, 8340157] but have not previously been used with high-resolution imaging techniques at synchrotrons. The challenge in the synchrotron context is that the high-resolution images produced include finely detailed features with high-frequency content. Approaches developed for medical images are typically insufficient since they are tailored to easily recognizable features with low-frequency content and, when applied to high-resolution images, can introduce undesired artifacts such as nonexistent features.

Our GAN-based method, TomoGAN, adapts the U-Net network architecture[8353466] to meet the specialized requirements of improving the quality of images generated by high-resolution tomography experiments at synchrotron light sources. We demonstrate that the TomoGAN model can be trained with limited data, performs well with high-resolution datasets, and generates greatly improved reconstructions of low-dose and noisy data, as shown in Figure 1 and Figure 2. We also show that our model can be applied to a variety of experimental datasets from different instruments, showing that it is resilient to overfitting and has wider applicability in practice.

We extensively evaluate our approach with real-world tomography datasets in order to prove the applicability of the proposed method in practice. These experimental datasets are from different types of shale samples collected at different facilities by using the same technique but different imaging conditions (different x-ray sources and detectors). We simulate two scenarios: (1) picking a subset of the x-ray projections, to simulate reduced number of projections as in the case of a lab-based CT system, and (2) applying synthetic noise to the x-ray projections, to simulate short exposure times. Both scenarios lead to noisy reconstructed images. We use one dataset to train TomoGAN and then evaluate the trained model on others. We compare the denoised (DN) images with ground truth and measure the quality of denoised images using (1) the structural similarity (SSIM) [ssim, Ching:2017:xdesign, scikit:2014] index and (2) image pixel value plots. Our evaluation results show that our approach can significantly improve image quality by reducing the noise in reconstructed images. We believe that this approach will also be effective for improving reconstruction quality when the same sample structure is imaged with different techniques with different imaging contrasts, for example, in multimodal imaging systems.


We describe in turn the TomoGAN model architecture, the process by which we train a TomoGAN model, and the datasets and experimental setup used for evaluation.

Model architecture

Generally, the task of denoising a reconstructed image can be posed as that of translating the noisy image into a corresponding output image that represents exactly the same features, with the features in the enhanced image indistinguishable from those in a ground truth version. Machine learning models learn to minimize a loss function—an objective that scores the quality of results—and although the learning process is automatic, the model still must be

told what needs to be minimized. If the model is (naively) asked to minimize the Euclidean distance between predicted and ground truth pixels, it will tend to produce blurry results [Nasrollahi2014] since the Euclidean distance is minimized by averaging all plausible outputs. With GANs[2014arXiv1406.2661G], we can instead specify a high-level goal such as “make the output indistinguishable from reality.” Thus, blurry images are not acceptable because they are obviously distinguishable from the real image.

Technically, a GAN is a class of deep generative models that aims to learn a target distribution in an unsupervised fashion [2018arXiv180704720K]. A GAN combines two neural networks, a generator () and a discriminator (), which compete in a zero-sum game: generates candidates that evaluates; those evaluations serve as feedback to . Thus, GANs are designed to reach a Nash equilibrium at which neither of the two networks can reduce its costs without changing the other player’s parameters. In this paper, we train a DNN to create a generator model to map a noisy reconstruction (i.e., conditionally use the noisy reconstruction as input to the instead of a random value, as in a standard GAN [2014arXiv1406.2661G]) into a form that can fool an adversarial model that is trained to distinguish reconstructions of noisy projections from the enhanced noisy reconstructions created by . Thus, we use to enhance images; simply works as a helper to train . A classic GAN generates samples from random noise inputs [2014arXiv1406.2661G]; In contrast, our TomoGAN network creates samples from closely related noisy inputs.


The TomoGAN generator network architecture, shown in Figure 3, is a variation of the U-Net architecture proposed for biomedical image segmentation by Ronneberger et al. [unet]

. It comprises a down-sampling network (left) followed by an up-sampling network (right). It adapts the U-Net architecture in three main ways: (1) there are three (instead of four) down-sampling layers and up-sampling layers; (2) all convolution layers have zero padding in order to keep the same image size; and (3) the input is a stack of

adjacent images (discussed in the next section), and eight convolution kernels are applied to the input.

In the down-sampling process, three sets of two convolution kernels (the three boxes) extract feature maps. Then, followed by a pooling layer, the feature map projections are distilled to the most essential elements by using a signal maximizing process. Ultimately, the feature maps are 1/8 of the original size: 128128 in Figure 3. Successful training should result in 128 channels in this feature map, retaining important features, which in Figure 15 seems to be the case.

In the up-sampling process, bilinear interpolation is used to expand feature maps. At each layer, high-resolution features from the down-sampling path (transmitted via

Copy operations) are concatenated to the up-sampled output from the layer below to form a large number of feature channels. Thus the network can propagate context information to higher-resolution layers, so that the following convolution layer can learn to assemble a more precise output based on this information.

Figure 3: The TomoGAN generator architecture is applied to images of size . Here, we show its operation when 1024. Each bar corresponds to a multichannel feature map, with the number of channels shown on the top of the bar and the height and width at the lower left edge. The symbols (see legend) show the operations used to transform one feature map to the next. Zero padding is used to make each convolution layer’s output size equal to its input size. The Copy operations allow low-level information to shortcut across the network, thus improving the spatial resolution of the corresponding features.


The TomoGAN discriminator has six 2D 3

3 CNN layers and two fully connected layers. Each CNN layer is followed by a leaky rectified linear unit as the activation function. Following the same logic as in

, all convolutional layers in have the same small 33 kernel size. Let CkSs-n denote a convolution layer with a kernel size of

, a stride of


output channels, and leaky ReLU activation function. The discriminator network consists of

C3S1-64, C3S2-64, C3S1-128, C3S2-128, C3S1-256, C3S2-256

, one hidden fully connected layer with 512 neurons and leaky ReLU activation, and an output layer with one neuron and linear activation. There is no sigmoid cross-entropy layer at the end of the discriminator, and the Wasserstein distance is used for calculating generator loss in order to improve the training stability


Model training

We present the loss functions used in the TomoGAN discriminator and generator.

Discriminator loss

In addition to the original critical loss presented by Arjovsky et al.[2017arXiv170107875A], we add a gradient penalty as suggested by Gulrajani et al.[improved-wgan] to Equation 1 for better training stability. Thus, the discriminator is trained to minimize


where is the trainable weight/parameter of , is one noisy image (th in the minibatch), is the training minibatch size, and is the corresponding ground truth image. , where is a random number between 0 and 1. As in Gulrajani et al.[improved-wgan], we use to balance the trade-off between the Wasserstein distance and the gradient penalty.

Generator loss

An important requirement while denoising images is that the generator not introduce artificial features. Thus, previous approaches have found it beneficial to extend the loss function to be minimized by the GAN generator with a more traditional loss, such as the L2 distance[Nasrollahi2014]. The discriminator’s job remains unchanged, but the generator is tasked not only with fooling the discriminator but also with being near the ground truth output in an L2 sense. Although the L2 distance is well known to produce blurry results on image generation problems, it can capture low-frequency small features accurately[KimLL15a]. Thus, we also add L2 loss to enforce correctness for low-frequency structures[pix2pix2017]. Moreover, perceptual losses are also used to penalize any structure that differs between output and target. Thus, the generator loss is a weighted average of three losses,: the original GAN adversarial loss , perceptual loss , and pixelwise mean L2-norm .

Adversarial loss.

As in the Wasserstein GAN[2017arXiv170107875A], we compute the adversarial loss as

Perceptual loss.

To allow the generator to retain a visually desirable feature representation, we use the mean squared error (MSE) of features extracted by a ImageNet pretrained VGG-19 network for the perceptual loss

[srgan]. Specifically, we use the first 16 layers of the ImageNet pretrained VGG[vgg] network to extract the feature representation of a given image. We then define the perceptual loss as the Euclidean distance between the feature representations of a ground truth image and the corresponding denoised image . Thus, the perceptual loss is calculated by


where and denote the dimensions of the feature maps extracted by the pretrained VGG network.

Since the VGG network is trained with natural images, namely, ImageNet, one may have concerns about how well it can perform on light source images. Yang et al.[8340157] demonstrated that the VGG network, when pretrained with the ImageNet dataset, can also serve as a good feature extractor for CT images.

Pixel-wise MSE.

The pixelwise MSE loss is calculated as


where W and H are the image’s width and height, respectively. The final loss function that the generator must minimize is thus


To reduce the chance of “mode collapse”[2014arXiv1406.2661G], we train once every four training steps of . Figure 4 shows the training pipeline. Once the model is trained, we can input a noisy image, and the generator outputs the enhanced image.

Figure 4: Model training pipeline. The parameter is chosen randomly for each training iteration, and the corresponding slices of the reconstructed low-dose () and normal-dose () projections are used as input. Once the model is trained, only the generator, , is used to advance the tomographic reconstructions.

Datasets and experimental setup

We used two simulated datasets and four experimental datasets. The experimental datasets are two shale samples that are imaged at two different facilities[B19, B18]). The datasets are provided for benchmarking purposes and retrieved from TomoBank[DeCarlo:2018:tomobank].

Simulated datasets

Each simulated dataset[jimaging4110128] has different-sized foam features distributed randomly in the 3D volume. We use the ASTRA toolbox[ASTRA] to generate the corresponding x-ray projections, with the total number of simulated projections (i.e., the total number of angles) set to 1024 and each projection consisting of 10241024 pixels. We reconstruct the 3D volumes, size 102410241024, with the filtered back projection algorithm in the TomoPy and ASTRA toolkits[gursoy2014tomopy, Pelt:pp5084], and we use the generated 2D slices in the 3D volumes for training and testing. Two different random seeds are used to generate two different datasets, and . We train with and evaluate the trained model on .

To simulate the case with a limited number of x-ray projections, we subsample projections in and to 1/2, 1/4, 1/8, and 1/16 of the total 1024 projections. For the case with a shorter exposure time, we simulate the measured photon counts simulated by using the Beer-Lambert law. Specifically, background per-pixel photon counts are initially set to 100, 500, 1000, and

, to simulate x-ray intensity. Then, the photon counts are resampled by using a Poisson distribution, and the new noisy measurements are used to generate the line integrals and noisy image reconstructions.

Experimental datasets

Shale is a challenging material because of its multiphase composition, small grain size, low but significant amount of porosity, and strong shape- and lattice-preferred orientation. In this work, we use two shale sample datasets obtained from the North Sea (sample N1) and the Upper Barnett Formation in Texas (sample B1)[B18]. Each sample has been imaged at two different light source facilities: the Advanced Photon Source (APS) at Argonne National Laboratory and the Swiss Light Source (SLS) at the Paul Scherrer Institut. For ease of reference, we name the resulting four datasets as , , , and .

When evaluating performance with a limited number of projections, we subsample to 1/2, 1/4, 1/8, and 1/16 of the total number of projections in each full dataset. We apply a model [boas2012ct] to simulate the impact of shorter exposure times on projection quality, and we add this simulated noise to the normal exposure time projections.

We arbitrarily select the dataset to train TomoGAN and use the other three datasets to evaluate its effectiveness and performance. To train or enhance the slice, we use slices from to as input to the generator , where is a tunable parameter. The generated (enhanced) images as well as the corresponding slice of the normal dose are then input to the discriminator to compute the loss and update

. We implemented TomoGAN with Tensorflow

[abadi2016tensorflow] and used three NVIDIA Tesla V100 GPU cards for training. We used the ADAM algorithm[Adam] to train both the generator and discriminator, with a batch size of four images and a depth of three for each image.

Experimental Results

In this section, we evaluate the performance of TomoGAN with different configurations and various datasets. We first show the effectiveness of a depth parameter in TomoGAN. Next, we present the performance of TomoGAN with both simulation and experimental datasets. We then compare the computational requirements and image quality of FBP followed by TomoGAN against a simultaneous iterative reconstruction technique.

Effectiveness of using adjacent slices in image enhancement

Features observed in adjacent slices tend to be highly correlated, usually with similar shape but different size. Stacking several neighboring slices to form multiple channels for the first CNN layer is equivalent to using a multiple linear regressor as a filter to reduce noise. Thus we use slices from to as input to enhance the th slice, where

is a hyperparameter.

Figure 5, which compares the quality of images when enhanced with vs. , shows the effectiveness of using adjacent slices. We see that has better reconstruction quality than , especially when the original feature edge is not sharp (marked by a red rectangle).

(a) Ground truth.
(b) Noisy, no enhancement.
(c) TomoGAN-enhanced, = 1.
(d) TomoGAN-enhanced, = 3.
Figure 5: Effectiveness of using adjacent slices when enhancing reconstructed images.

TomoGAN versus supervised models

Previous works have used supervised machine learning techniques[Yang2018, jimaging4110128] to learn the mapping between reconstructions from noisy and noise-free images. Our approach of training a GAN model and using its generator to improve reconstructions is intended to make TomoGAN more resilient to overfitting. To test this approach, we created a version of TomoGAN in which we disable the adversarial loss by setting to zero, so that the generator is trained in a supervised machine learning fashion. Figure 6 compares TomoGAN with the resulting supervised model.

(a) Ground truth.
(b) Noisy: Conventional.
(c) Noisy: Supervised.
(d) Noisy: TomoGAN.
Figure 6: Effectiveness of adversarial loss: enhancement of noisy reconstruction by (c) a supervised machine learning approach (TomoGAN without adversarial loss), and (d) TomoGAN.

We observe that features are shifted significantly in position in the supervised learning reconstruction. The red rectangle in

Figure 6 makes this clear. Notice how the sphere delineated by the rectangle in (a) is shifted down in (c), to the extent that its upper border is visible. (The TomoGAN-enhanced image, in (d), is shifted only slightly.) Notice also how some small features (e.g., those highlighted by the red dotted circles) are merged and others (e.g., those highlighted by the red dotted ellipse) are difficult to see. Such artifacts are not tolerable in practice.

Simulated datasets

Here, we show the effectiveness of TomoGAN in improving noisy reconstructed images based. We show both fewer projections and shorter exposure time cases that result in noisy measurement data and reconstructed images.

Fewer projections

In these experiments, we subsample the 1024 projections to 512, 256, 128, and 64 projections to simulate scenarios for the step scan CT systems where a limited number of projections are collected to reduce either the total experiment time (e.g., to capture dynamic features) or the x-ray dose (e.g., for x-ray dose sensitive sample). We perform the same subsampling for both and . Then, once TomoGAN is trained with , we use its generator to enhance reconstructions based on subsampled projections of . In Figure 7, we zoom in on a randomly selected area and plot the pixel values of a single horizontal line to show the improvement in image quality. We see that the reconstruction quality with reduced numbers of projections is poor in the absence of enhancement; however, the enhancement with TomoGAN yields images with quality comparable to that of the reconstructions based on 1024 projections.



Figure 7: Conventional vs. TomoGAN-enhanced reconstructions of simulated data, subsampled to (512, 256, 128, 64) projections. In each group of three elements, the two images show conventional and TomoGAN reconstructions, while the plot shows conventional, TomoGAN, and ground truth values for the 200 pixels on the horizontal line in the top left image.

We use the structural similarity index (SSIM)[ssim] to perform quantitative comparisons of the quality of TomoGAN-enhanced reconstructions. Using a reconstruction of a full 1024-projection dataset as ground truth, we calculate for each noisy (fewer projections or fewer photons) dataset both (a) SSIM between ground truth and the subsampled reconstruction, SSIM, and (b) SSIM between ground truth and the TomoGAN-enhanced version of that subsampled reconstruction, SSIM. The difference between these two quantities is the image quality improvement that results from the use of TomoGAN.

We further calculated the two values SSIM and SSIM for each slice in the reconstructed image that has features, thus allowing us to plot in Figure 9 the distribution of SSIM image quality scores for both the conventional (SSIM) and TomoGAN-enhanced (SSIM) images. We see that while image quality declines significantly with subsampling in the absence of TomoGAN enhancement, it remains high when enhancement is applied, even with high degrees of subsampling. SSIM is improved significantly by TomoGAN and is improved consistently across all the slices, suggesting that the method is reliable.

Shorter exposure times

Another way of reducing dose (or, equivalently, accelerating experiments) is to reduce x-ray exposure times. The use of shorter exposure times has attracted major attention for synchrotron-based tomography systems (e.g., Advanced Photon Source), because it can reduce data collection times significantly, as needed both to capture dynamical features and to reduce damage to organic samples from light source radiation. Reduced exposure time, however, compromises the signal-to-noise ratio, which affects final reconstruction image quality and may affect scientific findings.

In this experiment, we simulate x-ray projections with different photon intensities to simulate the effect of different exposure times. Specifically, we simulate with intensities of 100, 500, 1000, and photons per pixel and, for each, compute conventional and TomoGAN-enhanced reconstructions. Figure 8 shows representative results, while (b) quantifies the quality of the different reconstructed images via SSIM scores computed relative to ground truth. Ground truth here is a reconstruction based on the noise-free simulation data. We see that the trained TomoGAN model can enhance the reconstructions of noisy data to a quality comparable with that of reconstructions based on ground truth. Comparison of Figure 8 with Figure 7 shows that reducing exposure time leads to a different kind of noise from that when using fewer projections.



Figure 8: Conventional vs. TomoGAN-enhanced reconstructions of simulated data with intensity limited to {, 1000, 500, 100} photons per pixel. Figure elements are as in Figure 7.
(a) Fewer x-ray projections.
(b) Shorter exposure time.
Figure 9: Per-slice SSIM similarities with ground truth simulated data, for both conventional and TomoGAN-enhanced reconstructions and for different degrees of subsampling.

Experimental datasets

A key issue to address when dealing with experimental datasets is whether we can train TomoGAN on data collected from one sample and then use the trained model on other samples and/or at other facilities. Thus, we train TomoGAN with data collected on one sample at one facility, , and then evaluate its performance on noisy versions (both fewer projections and shorter exposure times) of three other datasets, , , and . Then, we use the trained model to enhance noisy images of the other three datasets. Since these other datasets include different samples ( and ) and x-ray projections collected at different facilities ( and ), this configuration mimics a practical use case.

Evaluation of TomoGAN is more difficult for the experimental datasets since there is no ground truth. Therefore, we considered the images that are reconstructed using full datasets (1024-projection) as their corresponding ground truths.

Fewer projections

As we did for the simulated datasets, we subsample the full-resolution (1024-projection) experimental datasets and then use TomoGAN to enhance the reconstructions of their subsampled datasets. In Figure 10, we show conventional and TomoGAN-enhanced reconstructions (for varying degrees of subsampling) of a randomly selected area of a slice from . The corresponding pixels of a randomly selected horizontal line within the selected areas are also presented for comparison. We see that the image qualities are improved significantly: even the small features are clearly visible in the enhanced images. The pixel value plots also show that the TomoGAN-enhanced images are comparable with the ground truth.



Figure 10: Conventional vs. TomoGAN-enhanced reconstructions of experimental dataset , subsampled to (512, 256, 128, 64) projections. Figure elements are as in Figure 7.

For the most challenging cases, in which only 64 x-ray projections are available for reconstruction, Figure 11

shows the pixel values of a horizontal line that crosses an arbitrarily chosen feature in each of the four datasets. Comparing the enhanced (green) and the ground truth (blue) lines in each case, we see that TomoGAN yields reconstructions comparable to the ground truth. The contrast of the enhanced images is clearly higher than that of the noisy reconstructions; and the variance of the feature pixel values (corresponding to the noise level) has been reduced considerably from the noisy reconstruction cases. Moreover, not only is the experiment time reduced when using fewer projections (by a factor corresponding to the subsampling ratio), but the dataset size and computation cost for reconstructions are also reduced by the same fraction.

Figure 11: Pixel values of an arbitrarily chosen line in each of the four experimental datasets, subsampled to 64 projections.

We again use SSIM to quantify the qualities of the conventional and TomoGAN-enhanced reconstructions. Recall that our model was trained with . Here we evaluate the trained model on a different shale sample imaged at a different facility: . The SSIM metric scores, shown in (a), indicate that TomoGAN consistently improves image quality for all slices at each subsampling level. However, the overall quality scores for the TomoGAN-enhanced reconstructions are clearly not as good as those for the simulated data in (a). We attribute this difference to the fact that the simulated dataset has a much better training dataset (ground truth) than does the experimental dataset and that the features in the simulation dataset are much simpler (only circles) than the features in the experimental dataset.

The results in Figure 12 show metric scores that are consistent across slices, suggesting that our model behaves well for all slices. We claim that this property is important for a black-box predictor.

Shorter exposure time

We also trained TomoGAN to enhance the quality of images reconstructed from short-exposure-time projections. First we applied the Poisson noise model[boas2012ct] to the four experimental datasets to create new (simulated) short-exposure-time datasets. Next, we used one of the experimental datasets, , plus its associated short exposure datasets, to train TomoGAN. Then, we used the trained model to enhance the noisy images of the other three datasets.

(a) Fewer x-ray projections.
(b) Shorter exposure time.
Figure 12: Per-slice SSIM similarities with a reconstruction of 1024-projection experimental data, for both conventional and TomoGAN-enhanced reconstructions and for different degrees of subsampling.

The qualities of the conventional and TomoGAN-enhanced reconstructions using SSIM are shown in (b). Comparing Figure 12 with Figure 9, we see that the improvement obtained for the experimental dataset is less than that obtained for the simulated dataset. The features in the experimental dataset are much more diverse (different shapes, sizes, and contrast) compared with those of the simulated dataset (only circles), and therefore the denoising performance is less for the experimental datasets. We also show in Figure 13, for the most challenging cases that use only 1/16 of normal exposure time to reconstruct, the pixel value of a horizontal line that cross an arbitrarily chosen feature in each of the four datasets.

Figure 13: Pixel values of an arbitrarily chosen feature in each of the four experimental datasets, with projections generated by using 1/16 of the normal exposure time. Feature shapes are different for each dataset.

Comparison with iterative methods

Iterative methods are more resilient to noise in projections [mohan:2015, geyer2015state], but they are computationally demanding and prohibitively expensive for large data [Schindera2013, Bicer2017, Wang2017, bicer2015rapid, wang:2016]. The filtered back-projection algorithm takes 42 ms to reconstruct one slice (using the TomoPy [gursoy2014tomopy, Pelt:pp5084] toolkit), and TomoGAN takes 4 ms to enhance the reconstruction, for a total of 46 ms per slice. In contrast, the simultaneous iterative reconstruction technique (SIRT) takes 550 ms to reconstruct one slice with 400 iterations. These times are all measured on one NVIDIA Tesla V100-SXM2 graphic card. Moreover, as illustrated in Figure 14, iterative reconstruction does not provide better image quality than that of our method.

SIRT + total variation postprocess.
Filtered back projection + TomoGAN post-process.
Figure 14: SIRT + total variation vs. TomoGAN: an image reconstructed from 64 simulated projections.

Model interpretation

A straightforward technique for understanding the working of a deep convolutional neural network is to examine the outputs of its various layers—its feature maps[Zhou_2016]—during the forward pass. Recall from Figure 3 that down-sampling reduces the image size to 1/8 of the original and increases the number of channels from to 128. Successful training requires that the convolutional kernel retain the important features as the image size decreases. To qualitatively understand how the trained generator improves image quality, we show in Figure 15

the feature maps for an input (noisy) image, six representative channels from the bottommost layer (the last layer of down sampling; red here indicates larger values and thus greater feature importance), and the final processed image. We focus on the bottommost layer because we expect it to show the most refined or important features selected by the preceding convolution kernels.

Figure 15: Feature maps from the TomoGAN network (Figure 3): (a) input (noisy) image; (b)–(g) six representative channels from the bottommost layer; (h) final processed image.

Examining Figure 15(b)–(g) in detail, we see that (b) tends to keep dense regions of the sample (white in (h)), while (e) detects nondense regions (air inside sample, black area in (h)); (c) detects the outer area in which there is no sample (should ideally be 0 in the output image); and (d) is similar to (b) but pays more attention to the details. Studying (h) closely, or (even better) examining Figure 1, which shows a zoomed-in version of the same image, we see that some areas contain many small features in the form of small holes. These small features could easily be lost in noise; interestingly, (f) and (g) seem to pay special attention to them. Intuitively, then, we may conclude that different channels (i.e., different convolution kernels) are responsible for capturing different types of features.


We applied a generative adversarial network to improve the quality of images reconstructed from noisy x-ray tomographic projections. Experiments show that our model, TomoGAN, once trained on one sample, can then be applied effectively to other similar samples, even if those samples are collected at a different facility and show different noise characteristics. Our results show that TomoGAN is general enough to provide visible improvement to images reconstructed from noisy projection datasets. We claim that this method has great potential for studying samples with dynamic features because it allows for good-quality reconstructions from experiments that run much faster than conventional experiments, either by collecting fewer projections or by using shorter exposure times per projection.



This work was supported in part by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This research used resources of the Advanced Photon Source, a U.S. Department of Energy (DOE) Office of Science User Facility operated for the DOE Office of Science by Argonne National Laboratory under Contract No. DE-AC02-06CH11357. We thank Daniel M. Pelt at Centrum Wiskunde & Informatica for sharing the simulated x-ray projection datasets.

Author contributions statement

Z.L. and T.B. designed the research. Z.L. built, trained, and tested the deep learning networks. I.F. contributed to the results analysis and presentation. All authors contributed to the analysis, discussion, and writing of the manuscript. All authors reviewed the manuscript.