DeepAI
Log In Sign Up

Self-Supervised Coordinate Projection Network for Sparse-View Computed Tomography

In the present work, we propose a Self-supervised COordinate Projection nEtwork (SCOPE) to reconstruct the artifacts-free CT image from a single SV sinogram by solving the inverse tomography imaging problem. Compared with recent related works that solve similar problems using implicit neural representation network (INR), our essential contribution is an effective and simple re-projection strategy that pushes the tomography image reconstruction quality over supervised deep learning CT reconstruction works. The proposed strategy is inspired by the simple relationship between linear algebra and inverse problems. To solve the under-determined linear equation system, we first introduce INR to constrain the solution space via image continuity prior and achieve a rough solution. And secondly, we propose to generate a dense view sinogram that improves the rank of the linear equation system and produces a more stable CT image solution space. Our experiment results demonstrate that the re-projection strategy significantly improves the image reconstruction quality (+3 dB for PSNR at least). Besides, we integrate the recent hash encoding into our SCOPE model, which greatly accelerates the model training. Finally, we evaluate SCOPE in parallel and fan X-ray beam SVCT reconstruction tasks. Experimental results indicate that the proposed SCOPE model outperforms two latest INR-based methods and two well-popular supervised DL methods quantitatively and qualitatively.

READ FULL TEXT VIEW PDF

page 1

page 6

page 7

page 8

page 9

11/02/2017

Sparse-View X-Ray CT Reconstruction Using ℓ_1 Prior with Learned Transform

A major challenge in X-ray computed tomography (CT) is reducing radiatio...
10/23/2022

Joint Rigid Motion Correction and Sparse-View CT via Self-Calibrating Neural Field

Neural Radiance Field (NeRF) has widely received attention in Sparse-Vie...
02/19/2022

A Lightweight Dual-Domain Attention Framework for Sparse-View CT Reconstruction

Computed Tomography (CT) plays an essential role in clinical diagnosis. ...
10/16/2012

Implementation of Radon Transformation for Electrical Impedance Tomography (EIT)

Radon Transformation is generally used to construct optical image (like ...
12/16/2019

A hierarchical approach to deep learning and its application to tomographic reconstruction

Deep learning (DL) has shown unprecedented performance for many image an...
10/26/2021

Software Implementation of the Krylov Methods Based Reconstruction for the 3D Cone Beam CT Operator

Krylov subspace methods are considered a standard tool to solve large sy...
03/23/2022

Adaptively Re-weighting Multi-Loss Untrained Transformer for Sparse-View Cone-Beam CT Reconstruction

Cone-Beam Computed Tomography (CBCT) has been proven useful in diagnosis...

I Introduction

X-ray Computed Tomography (CT) is widely applied in clinical diagnosis, industrial non-destructive testing, and safety inspection [34, 2]. In the recent several years, CT played a critical role in the auxiliary diagnosis and disease course monitoring of COVID-19 pneumonia [16]. However, the high-level radiation exposure caused by longitudinal frequent CT scans may increase the lifetime risk of cancer, especially for patients undergoing disease monitoring via CT scans such as pneumonia and cancer [6, 1]. Therefore, reducing the radiation exposure to CT imaging is an urgent need for the current public health status.

Methods Key Factors for SVCT Input Output Learning Target Main Objective Encoding Strategy
GRFF [32] INR Implicit Prior SV sinogram CT image CT image Fourier encoding
IntroTomo [38]
INR Implicit Prior +
Explicit Prior
SV sinogram CT image CT image Fourier encoding
NeRP [27] INR Implicit Prior
A prior CT image +
SV sinogram
CT image CT image Fourier encoding
CoIL [31] DV Sinogram Generation SV sinogram DV sinogram Sinogram Linear encoding
SCOPE (Ours)
INR Implicit Prior +
DV Sinogram Generation
SV sinogram DV sinogram CT image Hash encoding
TABLE I: Comparison of the proposed SCOPE with existing INR-based SVCT reconstruction methods.

Mathematically, the CT acquisition process can be formulated as a linear forward model:

(1)

where is the measurement data (also known as sinogram), denotes the CT image to be constructed, represents the CT system forward imaging model (e.g., Radon transform operator for parallel X-ray beam CT), and is the system noise. To reduce the imaging radiation dose, one can decrease the dimension of measurement data, denoting as , an undersampling of sinogram . To reconstruct a CT image from the under-sampled sinogram is referred to as Sparse-View (SV) CT reconstruction, a highly ill-posed under-determined inverse imaging problem. Using analytical reconstruction algorithms, such as Filtered Back-Projection (FBP) [10] on the SV sinogram , results in severe streaking artifacts in the constructed CT image.

To eliminate the streaking artifacts, conventional machine learning methods

[29, 11, 30, 25] formulate the under-determined inverse imaging as a regularized optimization problem. Explicit image prior assumptions (e.g., minimal Total Variation (TV) [25] for inducing smoothness in CT image) are adopted as a regularization term to restrict the search space and promote desired consistent image solutions [15]. Recently, supervised Deep Learning (DL) methods [9, 7, 41, 13, 28, 14, 5]

have shown great potential for SVCT reconstruction. Instead of directly solving the inverse imaging problem, a supervised DL reconstruction mostly employs Convolutional Neural Network (CNN) to learn an end-to-end mapping from low-quality images to their high-quality reconstruction over a large dataset. For example,

[9] proposed FBPConvNet that trains a U-Net [24] to learn a residual from artifacts-corrupted inputs to artifacts-free outputs. It is known that the performance of the supervised DL methods mostly depends on the scale and data distribution of the image pairs in the training dataset (i.e., a large-scale training dataset that covers more types of variations generally provides better performance). However, it is very challenging to build a comprehensive training dataset that includes all variable influencing factors on SVCT images, such as different levels of SV undersampling, different beam types for measurement data projection, imaging for different body tissue and other unlimited conditions. For real clinical applications, supervised DL methods performance might be quite limited and even fail in extreme cases (e.g., rare disease-caused image patterns) that are not covered by the training dataset.

Implicit Neural Representation (INR) has recently been proposed to model and represent 3D scenes from a sparse set of 2D views using coordinate-based deep neural networks in a self-supervised fashion. The core component in INR is a continuous implicit function parameterized by a Multi-Layer Perceptron (MLP). Benefiting from the image continuity prior imposed by the implicit function and the neural network architecture, INR has achieved superior performance in various vision problems (

e.g., surface reconstruction [20, 4, 17], view synthesis [18, 39, 22]

, and image super-resolution

[3, 33]).

For SVCT imaging, an early attempt by [32] indicated that INR could be applied to recover the CT image from a single SV sinogram without using any external data. Since then, more INR-based works [31, 27, 38, 23] have been emerged. We summarize the recent works that solve the tomography imaging inverse problem using INR-based methods in Table I to compare the design ideas and characteristics of various methods more clearly. [31] proposed CoIL that trains an INR to represent the SV sinogram and predicts the accordance Dense-View (DV) sinogram based on the continuous nature of INR. The CT image reconstruction is then processed by applying FBP [10] on the DV sinogram. However, the coordinate space of sinogram does not follow the intuitive orthogonal assumption of Fourier spacial encoding in INR model. Thus the CT reconstruction performance of CoIL is limited and not comparable with supervised methods. NeRP [27] proposed to utilize a series of longitudinal CT scans of the same subject to build CT image from SV sinogram. The INR is firstly trained in advance on a prior CT scan, and then used as image prior to recover high-quality CT images from the rest SV sinograms. However, longitudinal CT scans are not always available, which may limit the model application scenario. [38]

proposed IntroTomo consists of a sinogram prediction module and a geometry refinement module, which are applied iteratively. The former estimates the CT image from the SV sinogram, while the latter combines explicit priors (TV and Non-local Mean) into an optimization framework to refine CT images. The iterative training strategy produces elevation in CT image quality but severely prolongs reconstruction time.

Compared with the works in Table I, the proposed method is most related to [32, 31]. However, there are two major limitations unsolved in those works: (i) The INR estimated the desired CT image by minimizing the loss between the network predicted sinogram and the real measured sinogram. Thus the paradigm is more efficient in sinogram generation but not in CT image reconstruction. Due to the highly sparse sinogram, the MLP tends to approach an implicit function that overfits the SV sinogram, which manifests as noisy INR represented CT images; (ii) Due to the heavy computation of the coordinate-based based deep MLP, the image-specific INR based CT reconstructions generally performs poorly on time-efficiency.

Fig. 1: Workflow of the proposed SCOPE model.

In this paper, we propose a Self-supervised COordinate Projection nEtwork (SCOPE) to reconstruct the artifacts-free CT image from a single SV sinogram by solving the inverse tomography imaging problem. Compared with existing related works [32, 38], one of our key contributions is a simple and effective re-projection strategy that significantly improves the reconstruction quality of tomography images. This strategy is inspired by the simple relationship between linear algebra and inverse problems. We consider the SVCT inverse imaging problem as an under-determined system of linear equations. The total number of X-rays involved in all sinogram is equivalent to the number of independent linear equations (i.e., the rank of matrix in Equation 1). Thus the number of free variables in the linear equations largely increases with the decrease of matrix ’s rank in SV sinogram. By introducing INR, the solution space of image is efficiently constrained in a continuous space, inducing a satisfied inverse CT reconstruction from a highly sparse sinogram. However, this reconstruction is one unstable solution among infinite solutions that satisfy the acquired SV sinogram , which can be easily affected by network overfitting to the SV sinogram. For achieving a more stable solution, we propose the novel re-projection strategy to build a DV sinogram from this initial CT reconstruction. This process is equivalent to generating a higher rank linear equations system for presenting the inverse imaging task, which assist us to find a more stable solution with much fewer free variables in the CT image. Our experiment results demonstrate that through the re-projection strategy, we can further suppress the image noise while preserving the image details in the resulting CT images, which significantly improves the image reconstruction quality (+3 for PSNR at least). In addition, learning high-frequency signals via simple MLP is practically very difficult due to the spectral bias problem [21, 36]. Existing INR-based methods mostly combine pre-defined encoding modules (e.g., Fourier encoding in [32]) with a deep MLP to learn the implicit function, which results in heavy computational cost. To accelerate the model training, we integrate the recent hash encoding [19] into our SCOPE model, enabling shallow (three-layers) MLP achieve superior fitting ability (1 minute). We conduct extensive experiments on two publicly datasets (AAPM and COVID-19) for model evaluation. Both qualitative and quantitative results indicate that SCOPE outperforms two most recent INR-based methods (CoIL and GRFF [32]) and two well-known supervised CNN-based models (FBPConvNet [9] and TF U-Net [7]). To our best knowledge, the proposed SCOPE is the first self-supervised method that outperform the supervised DL models for SVCT reconstruction. The main contributions of this work are summarized as below:

  1. We propose SCOPE that recover the high-quality CT image from single SV sinogram without involving any external data.

  2. We propose a simple and effective re-projection reconstruction strategy that significantly improve the resulting CT image quality.

  3. We integrate the hash encoding [19] into our SCOPE, which greatly accelerates the model training and thus improves the model practicability.

  4. We conduct extensive experiments, and the results indicate that our SCOPE outperforms two latest INR-based methods and two well-known supervised DL methods, quantitatively and qualitatively.

Ii Methodology

Fig. 2: A toy example of different types of sample points in SVCT: Black sample points are scanned by multiple X-rays, whose pixel intensities are well constrained in the inverse imaging problem; Gray sample points are scanned by a single X-ray; White sample points are not scanned by any X-ray. The gray and white points are examples of free variable pixels, whose intensities are not tightly constrained in the inverse problem.

Ii-a Model Overview

In the proposed SCOPE model, we represent the desired CT image as a continuous function parameterized by a neural network:

(2)

where denote the trainable parameters (weights and biases) of the network, is any 2D spatial coordinate in the imaging plane, and is the corresponding image intensity at the position in the image . Based on the acquired SV sinogram , we then optimize the network to approximate the implicit function using back-propagation gradient descent algorithm to minimize the objective as below:

(3)

where represents the predicted SV sinogram and

is the loss function that measures the discrepancy between the predicted SV sinogram

and the acquired SV sinogram .

The key insight behind Equation 3 is that using the image continuity prior imposed by the implicit function and the neural network architecture to regularize the inverse imaging problem of SVCT and thus obtaining the desired solution. After the network training, the optimal image is theoretically . However, due to the highly under-determined inverse imaging problem, the network tends to approach an implicit function that overfits the SV sinogram and thus fails to approximate the desired implicit function well, which manifests as a lot of noises in the resulting CT image .

To this end, we propose a re-projection reconstruction strategy, in which the learned function is used to generate a DV sinogram . Then the final high-quality CT image is reconstructed by applying FBP [10] on . An essential insight is that the INR network overfitting on the SV sinogram results in unexpected pixel intensity mutations in the CT image reconstruction. Figure 2 illustrates a toy example of different types of sample points in SV reconstructed CT. For example, the black sample points are scanned by multiple X-rays, which can be considered as constrained by multiple linear equations. Thus the INR network can accurately recover its image intensity through the constraints of the cross projections. For the gray and white sample points scanned only by few, or even no X-rays, the pixel intensities are not tightly constrained in the inverse problem. Their pixel intensities are mostly approximated by the image continuity prior imposed by the implicit function, and are easily effected by the overfitting effected towards the sparse measurements of sinogram. Therefore, the learned function may output pixel intensity mutation at those free variable positions due to the overfitting problem. Althrough these mutations manifests similarly to image noise, they do not follow any typical distribution, thus the performance of inserting common denoising regularization term is limited [38]. The most effective strategy to suppress free variable mutations is thus to generate a higher rank linear equation system that tightly constrain the pixel intensities in the CT image and produce the same solution space with the SV sinogram. The generation of a DV sinogram from is thus proposed. The workflow of the proposed SCOPE model is shown in Figure 1.

Ii-B Learning Implicit Function

Figure 1 A demonstrates the pipeline of learning implicit function by a neural network. Given a SV sinogram , where and are the number of projection views and X-rays per view respectively, we first build a total number of X-rays from the sparse projection views (i.e., X-rays per view). Next, we feed the spatial coordinates of sample points along the SV X-rays into the implicit function to produce the corresponding image intensities . Finally, we compute the predicted projection of each one in the X-rays by a summation operator as below:

(4)

where are the sparse projection views and are the positions of X-rays in the detector.

Since the summation operator (Equation 4) is differentiable, the neural network used for parameterizing the implicit function can be optimized by using back-propagation gradient decent algorithm to minimize the loss between the predicted projection and the real projection from the SV sinogram . In this work, we employ norm as the loss function, which is defined as below:

(5)

where and are respectively the number of sampled projection views and the sampled X-rays per view at each training iteration.

Ii-C Re-projection Reconstruction

Figure 1 B shows the workflow of the proposed re-projection reconstruction strategy, in which the learned implicit function is used to generate the DV sinogram and then the final high-quality CT image is reconstructed from the DV sinogram. More specifically, we first build X-rays from dense projection views (i.e., X-rays per view). Then, the spatial coordinates of the sample points along the DV X-rays are fed into the learned function to predict the corresponding image intensities . Similarly, the projection of the X-rays are also calculated by the summation operator (Equation 4). The DV sinogram is thus generated. Inspired by the data consistency in MRI acceleration reconstruction [37], we combine the estimated DV sinogram with the acquired SV sinogram to generate the final DV sinogram. In particular, we replace the projection profiles at the corresponding views in the DV sinogram with the SV sinogram . Finally, we apply FBP [10] on the final DV sinogram to reconstruct the artifacts-free CT image.

Fig. 3: The architecture of the neural network used for parameterizing the implicit function , which consists of the hash encoding [19] and a three-layers MLP.

Ii-D Network Architecture

As shown in Figure 3, the network used for learning the implicit function consists of an encoding module (via hash encoding [19]) and a three-layers MLP. The network maps the input coordinate

to a feature vector

and then converts the feature vector to the image intensity . Formally, this process can be expressed as below:

(6)

where and represent respectively the trainable parameters of the MLP and hash encoding. They are simultaneously optimized to estimate the implicit function .

Ii-D1 Hash Encoding

The universal approximation theorem [8] proved that a pure MLP could approximate any complicated function theoretically. However, fitting high-frequency signals via the pure MLP is practically very difficult due to the spectral bias problem [21, 36]. To alleviate the issue, many encoding strategies [19, 31, 32, 18] have been proposed to map low-dimensional inputs into high-dimensional feature vectors, which allows the subsequent MLP to capture high-frequency components easily and thus reduce approximation error. In SCOPE, we adopt recent hash encoding. Unlike pre-defined encoding rules (e.g., position encoding [18]), hash encoding assigns a trainable feature for each input coordinate. This adaptive encoding strategy is task-specific, which benefit from using a shallow MLP while achieving powerful fitting ability. For a coordinate grid of , hash encoding first builds multi-resolution of levels feature maps . Here is the feature map at the -th level, where each element is a trainable feature vector of length. Then, each feature map is mapped into a hash table of size to reduce memory footprint. After the hash table construction, given input coordinate , we compute its feature vector at the

-th level via trilinear interpolation. Then, we concatenate

feature vectors to produce the final feature vector . More details about the hash encoding can be referred to [19]. Table II demonstrates the hyper-parameters of the hash encoding used in our SCOPE model.

Hyper-Parameter Symbol Value
Number of levels
Hash table size
Number of feature dimensions per entry
Coarsest resolution
Finest resolution
TABLE II: Hyper-parameters of the hash encoding [19] used in SCOPE.

Ii-D2 Three-Layers MLP

After the hash encoding, the 2D input coordinate is encoded to the high-dimensional feature vector . Then, a three-layers MLP is used to convert the feature vector to the image intensity

. The two hidden layers in the MLP have 64 neurons and are followed by ReLU activation and the output layer is followed by Sigmoid activation.

Ii-E Training Parameters

For the training of the proposed SCOPE model, at each iteration, we first randomly sample 3 ones (i.e., in Equation 5) from sparse projection views and then randomly sample 10 ones (i.e., in Equation 5) from X-rays per view. We adopt Adam optimizer [12] to minimize the loss function and the hyper-parameters of the Adam are as follows: . The initial learning rate is

and decays by a factor of 0.5 per 500 epochs. The total number of training epochs is 5000, which only takes about 5 minutes on a single NVIDIA RTX 3060 GPU. It is worth noting that all the training parameters above are the same for different cases, such as different types of X-ray beam and input views.

Iii Experiments

To evaluate the proposed SCOPE model, we perform the following three experiments: (i) We investigate the effectiveness of the re-projection reconstruction strategy; (ii) We validate the effectiveness of the hash encoding [19]; (iii) We compare our SCOPE with other five reconstruction methods quantitatively and qualitatively.

Function Hyper-parameter Value
radon theta
iradon theta
output_size
fanbeam D
FanRotationIncrement
FanSensorSpacing
ifanbeam D
FanRotationIncrement
FanSensorSpacing
OutputSize
  • is the number of projection views and are the size of raw slice.

TABLE III: Hyper-parameters of the four built-in functions in MATLAB used for data simulation.

Iii-a Dataset & Pre-processing

Iii-A1 AAPM dataset

Based on the normal dose part of the 2016 low-dose CT challenge AAPM dataset111https://www.aapm.org/GrandChallenge/LowDoseCT/ that consists of twelve 3D CT volumes acquired from twelve subjects, the AAPM dataset used in our experiments is built. Specifically, we extract 1171 2D slices from the 3D CT volumes on axial view and then split these slices into three parts: 1069 slices from ten subjects in training set, 98 slices from one subject in validation set, and 4 slices from one subject in test set. The training and validation sets are only prepared for optimizing two supervised CNN-based baselines (FBPConvNet [9] and TF U-Net [7]), while other methods (FBP [10], CoIL [31], GRFF [32], and our SCOPE) directly recover the corresponding high-quality CT image from the single SV sinogram.

Iii-A2 COVID-19 dataset

COVID-19 dataset [26] is a large-scale CT dataset, which consists of 3D CT volumes from 1000+ patients with confirmed COVID-19 infections. A 3D CT volume of the COVID-19 dataset is employed as an additional test data. We select 4 slices from the volume on axial view as 4 test samples.

Iii-A3 Dataset Simulation

For the parallel and fan X-ray beam SVCT reconstruction, we follow the strategies in [9, 7, 28] to simulate the pairs of low-quality and high-quality CT images. Specifically, we first generate the sinograms of different views (720, 120, 90, and 60) by projecting the raw slices using the built-in functions radon and fanbeam in MATLAB, respectively. Then, we transfer the sinograms back to CT images using the built-in functions iradon and ifanbeam in MATLAB, respectively. Detailed hyper-parameters of the four functions are demonstrated in Table III. The images reconstructed from 720 views are used for Ground Truth (GT), while the images reconstructed from 120, 90, and 60 views are used for input images corresponding to three different factors , , and . Note that the parallel and fan X-ray beam SVCT are considered as two independent reconstruction tasks. Thus, all the training and test processes are solely conducted.

Iii-B Compared Methods & Evaluation Metrics

Iii-B1 Compared Methods

We compare the proposed SCOPE model with five SVCT reconstruction methods: (i) FBP [10], a classical analytical reconstruction algorithm; (ii) CoIL [31], an INR-based method. Since the output of CoIL is the DV sinogram. we thus apply FBP on the generated DV sinogram to reconstruct the CT image; (iii) GRFF [32], an INR-based method with Gaussian random Fourier feature encoding strategy; (iv) FBPConvNet [9], a supervised DL method based on U-Net [24]; (v) TF U-Net [7], a supervised DL method based on Tigh Frame U-Net. We train FBPConvNet and TF U-Net on the training set of the AAPM dataset through Adam optimizer [12] with a mini-batch of 8. The learning rate starts form 10 from 10, which gradually decreases over each training epoch. The total training epochs are set as 500 and the best model is saved by checkpoints during the training process. The two INR-based methods (CoIL and GRFF) are implemented following the original papers.

Iii-B2 Evaluation Metrics

To quantitatively measure the performance of the compared methods, we calculate Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM)

[35]

. They are the two most used objective image quality metrics in low-level vision tasks. PSNR is defined based on pixel-by-pixel distance while SSIM measures structural similarity using the mean and variance of images. Moreover, we also compute LPIPS

[40], a DL-based objective perceptual similarity metric.

Fig. 4: Quantitative results of the proposed SCOPE model with two reconstruction strategies on the COVID-19 dataset for parallel (A) and fan (B) X-ray beam SVCT of 60, 90, and 120 views.

Iii-C Effectiveness of Re-projection Reconstruction

First, we investigate the effectiveness of the proposed re-projection strategy. After the network training, we adopt the following two strategies to recover the final CT image: (i) No Re-projection, we directly feed all the coordinates into the MLP to produce the corresponding image intensities; (ii) Re-projection Views, we employ the MLP to generate the DV sinograms (360, 480, 640, 720, and 1440 views) and then apply FBP algorithm [10] on the DV sinogram to reconstruct the CT images.

Figure 4 shows the quantitative results on the COVID-19 dataset for parallel and fan X-ray beam SVCT of 60, 90, and 120 views. Overall, the re-projection reconstruction strategy significantly improves performance for all the cases. For example, PSNR improves about 3 for fan X-ray beam SVCT reconstruction of 60 views. More importantly, there is a common trend in all the cases: The model performance gradually increase when the re-projection views increase from 360 to 720 but slightly decrease when the re-projection views increase from 720 to 1440. Our explanation is: (i) The projections of views less than 720 are not dense enough. Although the intensity mutation of the highest frequency are completely removed, the image details of sub-high frequency are also partially lost. (ii) The projections of views more than 720 are over-dense, which results in incomplete removal of the intensity mutation of the highest frequency and obtains the sub-optimal performance. Therefore, we set the re-projection views as in this paper, but it is worth noting that the parameter may need to be adjusted for specific cases. Figure 5 demonstrates the qualitative results on a test sample for fan X-ray beam SVCT of 90 views. We observe that the image from the direct reconstruction (i.e., No Re.) contains a lot of noises, while the results from our re-projection strategy are clear and closer to GT images.

Fig. 5: Qualitative results (Zoom regions and their absolute error maps) of the proposed SCOPE model with two reconstruction strategies on a test sample () from the COVID-19 dataset for fan X-ray beam SVCT of 90 views.

Iii-D Effectiveness of Hash Encoding

Next, we validate the effectiveness of the hash encoding [19]. The proposed SCOPE model with three different encoding modules are compared: (i) No Encoding, a pure nine-layers MLP without any encoding module; (ii) Position Encoding, a nine-layers MLP with the position encoding [18]; (iii) Hash Encoding, a three-layers MLP with the hash encoding.

X-ray Views No En. Pos. En. Ha. En.
Parallel
Fan
TABLE IV: Qualitative results (PSNR/SSIM/LPIPS) of the SCOPE model with three encoding modules on the COVID-19 dataset for parallel and fan X-rays beam SVCT of 60, 90, and 120 views.
Fig. 6: Qualitative results (Zoom regions and their absolute error maps) of our SCOPE model with three encoding modules on a test sample () of the COVID-19 dataset for fan X-ray beam SVCT of 90 views. Here the numbers in parentheses denote training epochs.
Fig. 7: Performance curves of SCOPE with three encoding modules over training epochs on a test sample () of the COVID-19 dataset for fan X-ray beam SVCT of 90 views.

Table IV shows the quantitative results on the COVID-19 dataset for parallel and fan X-ray beam SVCT of 60, 90, and 120 views. From the results, we see that compared with no encoding, both the position encoding and hash encoding significantly improve the model performance in terms of all the three metrics for all the cases. For example, PSNR respectively improve 13.59 (39.56 vs. 25.97) and 15.79 (41.76 vs. 25.97) for fan X-ray beam SVCT reconstruction of 90 views. This is due to the spectral bias problem [21, 36] (i.e., a pure MLP is biased toward learning low-frequency signals during the practical training). Thus, encoding modules are critical for improving the MLP’s ability to learn high-frequency signals. Besides, we observe that the hash encoding slightly outperforms the position encoding for the most cases. For instance, PSNR improves 1.66 (37.93 vs. 36.27) for fan X-ray beam SVCT reconstruction of 60 views. Figure 6 shows the qualitative results on a test sample (95) for fan X-ray beam SVCT reconstruction of 90 views. Overall, the hash encoding achieves the best image quality and the fastest reconstruction speed. benefiting from the shallower MLP (3 vs. 9), the hash encoding takes only about 1 to obtain the same performance as the position encoding. However, the position encoding takes 12, which is about 12 acceleration. We also show the performance curves of the SCOPE model with the three encoding modules over training epochs in Figure 7. Obviously, the hash encoding produces the best performance.

Dataset Views FBP CoIL GRFF FBPConvNet TF U-Net SCOPE (Ours)
AAPM
COVID-19
TABLE V: Quantitative results (PSNR/SSIM/LPIPS) of different methods on the AAPM and COVID-19 datasets for parallel X-rays beam SVCT of 60, 90, and 120 views.
Fig. 8: Qualitative results (Zoom regions and their absolute error maps) of different methods on a test sample () of the AAPM dataset for parallel X-ray beam SVCT of 90 views.
Fig. 9: Qualitative results (Zoom regions and their absolute error maps) of different methods on a test sample () of the COVID-19 dataset for parallel X-ray beam SVCT of 90 views.
Dataset Views FBP CoIL GRFF FBPConvNet TF U-Net SCOPE (Ours)
AAPM
COVID-19
TABLE VI: Quantitative results (PSNR/SSIM/LPIPS) of different methods on the AAPM and COVID-19 datasets for fan X-rays beam SVCT of 60, 90, and 120 views.
Fig. 10: Qualitative results (Zoom regions and their absolute error maps) of different methods on a test sample () of the AAPM dataset for fan X-ray beam SVCT of 90 views.
Fig. 11: Qualitative results (Zoom regions and their absolute error maps) of different methods on a test sample () of the COVID-19 dataset for fan X-ray beam SVCT of 90 views.

Iii-E Comparison with Other Methods

We compare the proposed SCOPE model with the five baselines on the AAPM and COVID-19 datasets for parallel and fan X-ray beam SVCT reconstruction. Since FBPConvNet [9] and TF U-Net[7] are supervised DL methods, we train them on the training set of the AAPM dataset. Other four methods (FBP [10], CoIL [31], GRFF [32], and our SCOPE model) are image-specific and thus they direct reconstruct the corresponding high-quality CT image from each SV sinogram. Note that the parallel and fan X-ray beam SVCT are considered two independent reconstruction tasks and thus all the training and test processes are solely conducted.

Iii-E1 Parallel X-ray Beam SVCT

Table V shows the quantitative results of the compared methods on the two datasets for parallel X-ray beam SVCT of 60, 90, and 120 views. On the AAPM dataset, our SCOPE produces the best performance for most cases. Compared with the two supervised DL methods (FBPConvNet [9] and TF U-Net [7]), SCOPE also obtains minor performance improvements. For instance, PSNR respectively improve 0.27 (42.18 vs. 41.95) and 0.44 (42.18 vs. 41.74) when 90 input views. On the COVID-19 dataset, we, however, observe that FBPConvNet and TF U-Net suffer from severe performance drops. This is mainly due to the domain shift problem (i.e., the training and test data do not share the same distribution). In comparison, our SCOPE model still produces excellent reconstruction results on the COVID-19 data because it is image-specific. For example, the difference in PSNR between SCOPE and FBPConvNet is up to +3.92 (40.57 vs. 36.65) when 90 input views. Figure 8 9 show the qualitative results on two test samples (109 and 90) from the two datasets for parallel X-ray beam SVCT of 90 views. On the test sample 109 from the AAPM dataset, both FBP [10] and CoIL [31] can not produce the satisfactory results, which still include a lot of streaking artifacts. GRFF [32] yields the smooth result that lost some image details. In comparison, FBPConvNet, TF U-Net, and SCOPE all recover the desirable images that are hardly distinguished from GT image. On the test sample 90 from the COVID-19 dataset, the two supervised models obtain sub-optimal results including moderate streaking artifacts, while our SCOPE model still produces high-quality image that is closest to GT image.

Iii-E2 Fan X-ray Beam SVCT

Table VI demonstrates the quantitative results of the compared methods on the two datasets for fan X-ray beam SVCT of 60, 90, and 120 views. We observe that the proposed SCOPE and GRFF [32] respectively produce the best and second-best performance in terms of all the three metrics for all the cases. For example, on the AAPM dataset for 90 input views, SCOPE and GRFF respectively achieve 40.92 and 37.54 , while TF U-Net [7] only obtains 32.47 in terms of PSNR. It is not common that FBPConvNet [9] and TF U-Net cannot produce the satisfactory performance on the AAPM dataset although they are trained on the AAPM dataset. We guess that, for learning the end-to-end mapping as in the supervised DL methods, the fan X-ray beam CT is a more difficult task than the parallel X-ray beam CT when same input views. In our experiments, for the sinograms of the same projection views, the results of the fan X-ray CT include more severe streaking artifacts than that of the parallel X-ray CT after applying the FBP algorithm [10]. While FBPConvNet [9] and TF U-Net [7] directly learn the inverse mapping from the artifacts-corrupted inputs to the artifacts-free outputs. Therefore, they are not expected to perform as well in the fan X-ray CT as in the parallel X-ray CT. In contrast, GRFF [32] and SCOPE train neural network to learn the implicit function of the unknown CT image by computing the loss on the SV sinogram (i.e., they do not manipulate image information directly). Thus, they all work well for different types of X-ray beams. Figure 10 11 show the qualitative results on two test samples (104 and 95) from the two datasets for fan X-ray beam SVCT reconstruction of 90 views. We see that the four compared methods do not recover the good results. The results from FBP algorithm [10] and CoIL [31] include severe streaking artifacts, while FBPConvNet [9] and TF U-Net [7] produce the overly smooth results. GRFF [32] obtains the second-best results that lost some image details. Only the proposed SCOPE removes streaking artifacts greatly and preserves fine image details well.

Iv Conclusion

In this work, we propose SCOPE, a self-supervised INR-based method for SVCT reconstruction. Like previous INR works [32], SCOPE represents the desired CT image as an implicit continuous function and trains a neural network to learn the implicit function by minimizing predicted errors on the acquired SV sinogram. Benefiting from image continuity prior imposed by the implicit function and neural network architecture prior, the function can be estimated. However, the solution is not optimal due to the overfitting problem. To this end, we propose a simple and effective re-projection strategy that greatly improves the resulting CT image quality. Besides, we adopt the recent hash encoding [19] into our SCOPE to accelerate the model training greatly. Experimental results on two publicly available datasets indicate that the proposed SCOPE model is not only superior to two last INR-based methods, but also outperforms two well-known supervised CNN-based methods, qualitatively and quantitatively.

References

  • [1] D. J. Brenner and E. J. Hall (2007) Computed tomography—an increasing source of radiation exposure. New England journal of medicine 357 (22), pp. 2277–2284. Cited by: §I.
  • [2] H. Chen, Y. Zhang, M. K. Kalra, F. Lin, Y. Chen, P. Liao, J. Zhou, and G. Wang (2017) Low-dose ct with a residual encoder-decoder convolutional neural network. IEEE transactions on medical imaging 36 (12), pp. 2524–2535. Cited by: §I.
  • [3] Y. Chen, S. Liu, and X. Wang (2021) Learning continuous image representation with local implicit image function. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8628–8638. Cited by: §I.
  • [4] Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5939–5948. Cited by: §I.
  • [5] Q. Ding, H. Ji, H. Gao, and X. Zhang (2021) Learnable multi-scale fourier interpolation for sparse view ct image reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 286–295. Cited by: §I.
  • [6] E. Hall and D. Brenner (2008) Cancer risks from diagnostic radiology. The British journal of radiology 81 (965), pp. 362–378. Cited by: §I.
  • [7] Y. Han and J. C. Ye (2018) Framing u-net via deep convolutional framelets: application to sparse-view ct. IEEE transactions on medical imaging 37 (6), pp. 1418–1429. Cited by: §I, §I, §III-A1, §III-A3, §III-B1, §III-E1, §III-E2, §III-E.
  • [8] K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural networks 2 (5), pp. 359–366. Cited by: §II-D1.
  • [9] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser (2017) Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing 26 (9), pp. 4509–4522. Cited by: §I, §I, §III-A1, §III-A3, §III-B1, §III-E1, §III-E2, §III-E.
  • [10] A. C. Kak and M. Slaney (2001) Principles of computerized tomographic imaging. SIAM. Cited by: §I, §I, §II-A, §II-C, §III-A1, §III-B1, §III-C, §III-E1, §III-E2, §III-E.
  • [11] K. Kim, J. C. Ye, W. Worstell, J. Ouyang, Y. Rakvongthai, G. El Fakhri, and Q. Li (2014) Sparse-view spectral ct reconstruction using spectral patch-based low-rank penalty. IEEE transactions on medical imaging 34 (3), pp. 748–760. Cited by: §I.
  • [12] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §II-E, §III-B1.
  • [13] H. Lee, J. Lee, H. Kim, B. Cho, and S. Cho (2018) Deep-neural-network-based sinogram synthesis for sparse-view ct image reconstruction. IEEE Transactions on Radiation and Plasma Medical Sciences 3 (2), pp. 109–119. Cited by: §I.
  • [14] Y. Li, K. Li, C. Zhang, J. Montoya, and G. Chen (2019) Learning to reconstruct computed tomography images directly from sinogram data under a variety of data acquisition conditions. IEEE transactions on medical imaging 38 (10), pp. 2469–2481. Cited by: §I.
  • [15] R. Liu, Y. Sun, J. Zhu, L. Tian, and U. Kamilov (2021) Zero-shot learning of continuous 3d refractive index maps from discrete intensity-only measurements. arXiv preprint arXiv:2112.00002. Cited by: §I.
  • [16] C. Long, H. Xu, Q. Shen, X. Zhang, B. Fan, C. Wang, B. Zeng, Z. Li, X. Li, and H. Li (2020) Diagnosis of the coronavirus disease (covid-19): rrt-pcr or ct?. European journal of radiology 126, pp. 108961. Cited by: §I.
  • [17] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470. Cited by: §I.
  • [18] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp. 405–421. Cited by: §I, §II-D1, §III-D.
  • [19] T. Müller, A. Evans, C. Schied, and A. Keller (2022) Instant neural graphics primitives with a multiresolution hash encoding. arXiv preprint arXiv:2201.05989. Cited by: item 3, §I, Fig. 3, §II-D1, §II-D, TABLE II, §III-D, §III, §IV.
  • [20] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 165–174. Cited by: §I.
  • [21] N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y. Bengio, and A. Courville (2019) On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301–5310. Cited by: §I, §II-D1, §III-D.
  • [22] D. Rebain, W. Jiang, S. Yazdani, K. Li, K. M. Yi, and A. Tagliasacchi (2021) Derf: decomposed radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14153–14161. Cited by: §I.
  • [23] A. W. Reed, H. Kim, R. Anirudh, K. A. Mohan, K. Champley, J. Kang, and S. Jayasuriya (2021) Dynamic ct reconstruction from limited views with implicit neural representations and parametric motion fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2258–2268. Cited by: §I.
  • [24] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I, §III-B1.
  • [25] L. I. Rudin, S. Osher, and E. Fatemi (1992) Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena 60 (1-4), pp. 259–268. Cited by: §I.
  • [26] S. Shakouri, M. A. Bakhshali, P. Layegh, B. Kiani, F. Masoumi, S. Ataei Nakhaei, and S. M. Mostafavi (2021) COVID19-ct-dataset: an open-access chest ct image repository of 1000+ patients with confirmed covid-19 diagnosis. BMC Research Notes 14 (1), pp. 1–3. Cited by: §III-A2.
  • [27] L. Shen, J. Pauly, and L. Xing (2022) NeRP: implicit neural representation learning with prior embedding for sparsely sampled image reconstruction. IEEE Transactions on Neural Networks and Learning Systems (), pp. 1–13. External Links: Document Cited by: TABLE I, §I.
  • [28] T. Shen, X. Li, Z. Zhong, J. Wu, and Z. Lin (2019) R-net: recurrent and recursive network for sparse-view ct artifacts removal. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 319–327. Cited by: §I, §III-A3.
  • [29] E. Y. Sidky, C. Kao, and X. Pan (2006) Accurate image reconstruction from few-views and limited-angle data in divergent-beam ct. Journal of X-ray Science and Technology 14 (2), pp. 119–139. Cited by: §I.
  • [30] E. Y. Sidky and X. Pan (2008) Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization. Physics in Medicine & Biology 53 (17), pp. 4777. Cited by: §I.
  • [31] Y. Sun, J. Liu, M. Xie, B. Wohlberg, and U. S. Kamilov (2021) CoIL: coordinate-based internal learning for tomographic imaging. IEEE Transactions on Computational Imaging 7 (), pp. 1400–1412. External Links: Document Cited by: TABLE I, §I, §I, §II-D1, §III-A1, §III-B1, §III-E1, §III-E2, §III-E.
  • [32] M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng (2020) Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems 33, pp. 7537–7547. Cited by: TABLE I, §I, §I, §I, §II-D1, §III-A1, §III-B1, §III-E1, §III-E2, §III-E, §IV.
  • [33] J. Tang, X. Chen, and G. Zeng (2021) Joint implicit image function for guided depth super-resolution. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 4390–4399. Cited by: §I.
  • [34] G. Wang, H. Yu, and B. De Man (2008) An outlook on x-ray ct research and development. Medical physics 35 (3), pp. 1051–1064. Cited by: §I.
  • [35] Z. Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Document Cited by: §III-B2.
  • [36] Z. J. Xu, Y. Zhang, T. Luo, Y. Xiao, and Z. Ma (2019) Frequency principle: fourier analysis sheds light on deep neural networks. arXiv preprint arXiv:1901.06523. Cited by: §I, §II-D1, §III-D.
  • [37] B. Yaman, S. A. H. Hosseini, S. Moeller, J. Ellermann, K. Uğurbil, and M. Akçakaya (2020) Self-supervised learning of physics-guided reconstruction neural networks without fully sampled reference data. Magnetic resonance in medicine 84 (6), pp. 3172–3191. Cited by: §II-C.
  • [38] G. Zang, R. Idoughi, R. Li, P. Wonka, and W. Heidrich (2021) IntraTomo: self-supervised learning-based tomography via sinogram synthesis and prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1960–1970. Cited by: TABLE I, §I, §I, §II-A.
  • [39] K. Zhang, G. Riegler, N. Snavely, and V. Koltun (2020) Nerf++: analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492. Cited by: §I.
  • [40] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018)

    The unreasonable effectiveness of deep features as a perceptual metric

    .
    In CVPR, Cited by: §III-B2.
  • [41] Z. Zhang, X. Liang, X. Dong, Y. Xie, and G. Cao (2018) A sparse-view ct reconstruction method based on combination of densenet and deconvolution. IEEE transactions on medical imaging 37 (6), pp. 1407–1417. Cited by: §I.