Deep learning-based synthetic-CT generation in radiotherapy and PET: a review

02/04/2021
by   Maria Francesca Spadea, et al.
0

Recently, deep learning (DL)-based methods for the generation of synthetic computed tomography (sCT) have received significant research attention as an alternative to classical ones. We present here a systematic review of these methods by grouping them into three categories, according to their clinical applications: I) to replace CT in magnetic resonance (MR)-based treatment planning, II) facilitate cone-beam computed tomography (CBCT)-based image-guided adaptive radiotherapy, and III) derive attenuation maps for the correction of Positron Emission Tomography (PET). Appropriate database searching was performed on journal articles published between January 2014 and December 2020. The DL methods' key characteristics were extracted from each eligible study, and a comprehensive comparison among network architectures and metrics was reported. A detailed review of each category was given, highlighting essential contributions, identifying specific challenges, and summarising the achievements. Lastly, the statistics of all the cited works from various aspects were analysed, revealing the popularity and future trends, and the potential of DL-based sCT generation. The current status of DL-based sCT generation was evaluated, assessing the clinical readiness of the presented methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

12/27/2019

Deep Learning in Medical Image Registration: A Review

This paper presents a review of deep learning (DL) based medical image r...
01/28/2020

Deep Learning in Multi-organ Segmentation

This paper presents a review of deep learning (DL) in multi-organ segmen...
11/30/2020

Deep Interactive Denoiser (DID) for X-Ray Computed Tomography

Low dose computed tomography (LDCT) is desirable for both diagnostic ima...
01/13/2020

Review and Prospect: Deep Learning in Nuclear Magnetic Resonance Spectroscopy

Since the concept of deep learning (DL) was formally proposed in 2006, i...
08/22/2018

Deep Boosted Regression for MR to CT Synthesis

Attenuation correction is an essential requirement of positron emission ...
08/21/2019

Improved MR to CT synthesis for PET/MR attenuation correction using Imitation Learning

The ability to synthesise Computed Tomography images - commonly known as...

Code Repositories

overview_sct

Repository to collect all the references on generation of synthetic computed tomography (sCT) with deep learning/convolutional networks. Generated from Spadea MF, Maspero M et al Med. Phys. 2021


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I. Introduction

The impact of medical imaging in the diagnosis and therapy of oncological patients has grown significantly over the last decadesHusband2016. Especially in radiotherapy (RT)Beaton2019, imaging plays a crucial role in the entire workflow, from treatment simulation to patient positioning and monitoring Verellen2007, jaffray2012image, seco2015imaging, IAEA2017.
Traditionally, computed tomography (CT) is considered the primary imaging modality in RT, since it provides accurate and high-resolution patient’s geometry, enabling direct electron density conversion that is needed for dose calculations Seco2006. X-ray-based imaging, including planar imaging and cone-beam computed tomography (CBCT), are widely adopted for patient positioning and monitoring before, during or after the dose delivery jaffray2012image. Along with CT, functional and metabolic information, mainly derived from positron emission tomography (PET), is commonly acquired, allowing tumour staging and improving tumour contouring Unterrainer2020pet. Magnetic resonance imaging (MRI) has also proved its added value for tumours and organs-at-risk (OAR) delineation thanks to its superb soft tissue contrast Dirix2014, Schmidt2015.
To benefit from the complementary advantages offered by different imaging modalities, MRI is generally registered to CT Devic2012. However, residual misregistration and differences in patient set-up may introduce systematic errors that would affect the accuracy of the whole treatment Nyholm2009, Ulin2010.
Recently, MR-only based RT has been proposed Fraas1987, Lee2003, Nyholm_counter to eliminate residual registration errors. Furthermore, it can simplify and speed-up the workflow, decreasing patient’s exposure to ionising radiation, which is particularly relevant for repeated simulations Kapanen2013 or fragile populations, e.g. children. Also, MR-only RT may reduce overall treatment costs Owrangi2018 and workload Karlsson2009. Additionally, the development of MR-only techniques can be beneficial for MRI-guided RT Lagendijk2014.

The main obstacle regarding the introduction of MR-only radiotherapy is the lack of tissue attenuation information, required for accurate dose calculations Nyholm2009, Jonsson2010. Many methods have been proposed to convert MR to CT-equivalent representations, often known as synthetic CT (sCT), for treatment planning and dose calculation. These approaches are summarised in two specific reviews on this topic Edmund2017, Johnstone2018 or in broader reviews about MR-only radiotherapy and proton therapy Owrangi2018, Wafa2018, Kerkmeijer2018, Bird2019, Hoffmann2020.

Additionally, similar techniques to derive sCT from a different imaging modality have been envisioned to improve the quality of CBCT taasti2020developments. Cone-beam computed tomography plays a vital role in image-guided adaptive radiation therapy (IGART), for photon and proton therapy. However, due to the severe scatter noise and truncated projections, image reconstruction is affected by several artefacts, such as shading, streaking and cupping ZhuCTBCTnoise, ZhuCTBCTscatter. For this reason, daily CBCT has not commonly been used for online plan adaption. The conversion of CBCT-to-CT would allow accurate dose computation and improve the quality of IGART provided to the patients.

Finally, sCT estimation is also crucial for PET attenuation correction. Accurate PET quantification requires a reliable photon attenuation correction (AC) map, usually derived from CT. In the new PET/MRI hybrid scanners, this step is not immediate, and MRI to sCT translation has been proposed to solve the MR attenuation correction (MRAC) issue. Besides, standalone PET scanner can benefit from the derivation of sCT from uncorrected PET.

mehranian2016vision, mecheter2020mr, Catana2020.

In the last years, the derivation of sCT from MR, PET or CBCT has raised increasing interest based on artificial intelligence algorithms such as machine learning or deep learning (DL)

lecun2015deep

. This paper aims to perform a systematic review and summarise the latest developments, challenges and trends in DL-based sCT generation methods. Deep learning is a branch of machine learning, which is a field of artificial intelligence, that involves using neural networks to generate hierarchical representations of the input data to learn a specific task without the need for hand-engineered features

Goodfellow2016. Recent reviews have discussed the application of deep learning in radiotherapy Meyer2018, Sahiner2018, Boon2018, Wang2019rev, Boldrini2019, Jarrett2019, Kiser2019, and in PET attenuation correction Catana2020

. Convolutional neural networks (CNNs), which are the most successful type of models for image processing 

Krizhevsky2012, Litjens2017, have been proposed for sCT generation since 2016 nie2016estimating, with a rapidly increasing number of published papers on the topic. However, DL-based sCT generation has not been reviewed in details, except for applications in PET lee2020review. With this survey, we aim at summarising the latest developments in DL-based sCT generation highlighting the contributions based on the applications and providing detailed statistics discussing trends in terms of imaging protocols, DL architectures, and performance achieved. Finally, the clinically readiness of the reviewed methods will be discussed.

Ii. Material and Methods

A systematic review of techniques was carried out using the  PRISMA guidelines. PubMed, Scopus and Web of Science databases were searched from January 2014 to December 2020 using defined criteria (for more details see Appendix Appendix). Studies related to radiation therapy, either with photons or protons and attenuation correction for PET were included when dealing with sCT generation from MRI, CBCT or PET. This review considered external beam radiation therapy, excluding, therefore, investigations that are focusing on brachytherapy. Conversion methods based on basic machine learning techniques were not considered in this review, preferring only deep learning-based approaches. Also, the generation of dual-energy CT was not considered along with the direct estimation of corrected attenuation maps from PET. Finally, conference proceedings were excluded: proceedings can contain valid methodologies; however, the large number of relevant abstracts and incomplete report of information was considered not suitable for this review. After the database search, duplicated articles were removed and records screened for eligibility. A citation search of the identified articles was performed.

Each included study was assigned to a clinical application category. The selected categories were:

  1. [label=]

  2. MR-only RT;

  3. CBCT-to-CT for image-guided (adaptive) radiotherapy;

  4. PET attenuation correction.

For each category, an overview of the methods was constructed in the form of tables111The tables presented in this review have been made publicly accessible at https://matteomaspero.github.io/overview_sct.. The tables were constructed, capturing salient information of DL-based sCT generation approaches, which has been schematically depicted in Figure 1.

Figure 1: Schematic of deep learning-based sCT generation study. The input images/volumes, either being MRI (green), CBCT (yellow) or PET (red) are converted by a convolutional neural network (CNN) into sCT. The CNN is trained to generate sCT similar to the target CT (blue). Several choices can be made in terms of network architecture, configuration, data pairing. After the sCT generation, the output image/volume is evaluated with image- and task-specific metrics.

Independent of the input image (MRI, CBCT or PET) the chosen architecture (CNN) can be trained with paired on unpaired input data and different configurations. In this review, we define the following configurations: 2D (single slice, 2D, or patch, 2Dp) when training was performed considering transverse (tra), sagittal (sag) or coronal (cor) images; 2D+ when independently trained 2D networks in different views were combined during of after inference; multi-2D (m2D, also known as multi-plane) when slices from different views, e.g. transverse, sagittal and coronal, were provided to the same network; 2.5D when training was performed with neighbouring slices which were provided to multiple input channels of one network; 3D when volumes were considered as input (the whole volume, 3D, or patches, 3Dp). The architectures generally considered are introduced in the next section (II.A.). The sCTs are generated inferring on an independent test set the trained network or combining an ensemble (ens) of trained networks. Finally, the quality of the sCT can be evaluated with image-based or task-specific metrics (II.B.).

For each of the sCT generation category, we compiled tables providing a summary of the published techniques, including the key findings of each study and other pertinent factors, here indicated: the anatomic site investigated; the number of patients included; relevant information about the imaging protocol; DL architecture, the configuration chosen to sample the patient volume (2D or 2D+ or m2D, 2.5D or 3D); using paired/unpaired data during the network training; the radiation treatment adopted, where appropriate, along with the most popular metrics used to evaluate the quality of sCT (see II.B.).

The year of publication for each category was noted according to the date of first online appearance. Statistics in terms of popularity of the mentioned fields were calculated with pie charts for each category. Specifically, we subdivided the papers according to the anatomical region they dealt with: abdomen, brain, head & neck (H&N), thorax, pelvis and whole body; where available, tumour site was also reported. A discussion of the clinical feasibility of each methodology and observed trends follows.

The most common network architecture and metrics will be introduced in the next sections to facilitate the tables’ interpretation.

Ii.A. Deep learning for image synthesis

Medical image synthesis can be formulated as an image-to-image translation problem, where a model that maps input image (A) to a target image (B) has to be found

Yu2020. Among all the possible strategies, DL methods have dramatically improved state of the art Wang2020review.

Figure 2: Deep learning architectures used for image-to-image translation

. In the most straightforward configurations (CNN and U-Net, top left and right, respectively), a single loss function between input and output images is computed. GANs (bottom) use more than one CNN and loss to train the generator’s performance (G). Cycle-GANs enable unsupervised learning by employing multiple GANs and cycle-consistency losses (

).

DL approaches mostly used to synthesise sCT belong to the class of CNNs, where convolutional filters are combined through weights (also called parameters) learned during training. The depth is provided by using multiple layers of filters Lecun2015. The training is regulated by finding the ”optimal” model parameters according to the search criterion defined by a loss function (). Many CNN-based architectures have been proposed for image synthesis, with the most popular being the U-netsronneberger2015u

and generative adversarial networks (GANs)

goodfellow2014generative (see figure 2). U-net presents an encoding and a decoding path with additional skip connections to extract and reconstruct image features, thus learning to go from A to B. In the most simple GAN architecture, two networks are competing: a generator (G) that is trained to obtain synthetic images (B) similar to the input set (

), and a discriminator (D) that is trained to classify whether B

is real or fake () improving G’s performances. GANs learn a loss that combines both the tasks resulting in realistic images Isola2017. Given these premises, many variants of GANs can be arranged, with U-net being employed as a possible generator in the GAN framework. We will not detail all possible configurations since it is not the scope of this review, and we address the interested reader to wu2017survey, creswell2018generative, yi2019generative. A particular derivation of GAN, called cycle-consistent GAN (cycle-GAN), is worth mentioning. Cycle-GANs opened the era of unpaired image-to-image translation zhu2017unpaired. Here, two GANs are trained, one going from A to B, called forward pass (forw) and the second going from B to A, called backwards pass (back) are adopted with their related loss terms (Figure 2 bottom right). Two consistency losses are introduced, aiming at minimising the difference between A and A and B and B, enabling unpaired training.

Ii.B. Metrics

An overview of the metrics used to assess and compare the reviewed publications’ performances is summarised in Table 1.

width=center Category Metric Image similarity , with =voxel number in ROI; with mean, variance/covariance dynamic range, and Geometry accuracy Task specific MR-only , with =dose; CBCT-to-CT DPR = % of voxel with % in a ROI GPR% of voxel with in a ROI DVHdifference of specific points in dose-volume histogram plot PET reconstruction

Table 1: Overview of the most popular metrics

reported in the literature subdivi-ded in image similarity, geometric accuracy, task-specific metrics, and their category.

Image similarity The most straightforward way to evaluate the quality of the sCT is to calculate the similarity of the sCT to the ground truth/target CT. The calculation of voxel-based image similarity metrics implies that sCT and CT are aligned by translation, rigid (rig), affine (aff) or deformable (def) registrations. Most common similarity metrics are reported in Table 1

and include: mean absolute error (MAE), sometimes referred as mean absolute prediction error, peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). Other less common metrics are the cross-correlation (CC) and normalised cross-correlation (NCC), along with the (root) mean squared error ((R)MSE).

Geometric accuracy Along with voxel-based metrics, the geometric accuracy of the generated sCT can be also assessed; in this context, using binary masks can facilitate such a task. For example, dice similarity coefficient (DSC) is a widespread metric that assesses the accuracy of depicting specific tissue classes/structures, e.g. bones, fat, muscle, air and body. In this context, DCS is calculated after having applied a threshold to CT and sCT, and, if necessary, morphological operations on the binary masks. Other image-based metrics can be subdivided according to the application, and it will be presented in the appropriate sub-category in the following sections.

Task-specific metrics Additionally, task-specific metrics can be considered. For example, in the case of MR-only RT and CBCT-to-CT for adaptive RT, the accuracy of dose calculation on sCT is generally compared to CT-based dose through dose difference (DD), dose pass rate (DPR), analysis Low2010 via gamma pass rate (GPR) and, in the case of proton RT, range shift (RS) analysispaganetti2012range. Also, the differences among clinically relevant dose-volume histogram (DVH) points are often reported. Dose calculations are either performed for photon (x) and proton () RT. For sCT for PET attenuation correction, the absolute and relative error of the PET reconstruction ( and , respectively) are generally reported along with the difference in standard uptake values (SUV).

Please note that differences could occur in the region-of-interest (ROI) where the metrics are calculated. For example, MAE can be computed on the whole predicted volume, in a volume of interest (VOI) or cropped volume. In addition to that, the implementation of the metric computation can change. For example, (), () and () can be calculated on different dose thresholds and with 2D or 3D algorithms, or values are chosen to threshold the CT/sCT for DSC may vary among the literature. In the following sections, we will highlight the possible differences speculating on the impact.

Iii. Results

Database searching led to 91 records on PubMed, 98 on Scopus and 218 on Web of Science. After duplicates removal and content check, 83 eligible papers were found.
Figure 3 summarises the number of articles published by year, grouped in (), 15 () and 17 () for MR-only RT (category I), CBCT-to-CT for adaptive RT (category II), and sCT for PET attenuation correction (category III), respectively. The first conference paper appeared in 2016 nie2016estimating. Given that we excluded conference papers from our search, we found that the first work was published in 2017 and, in general, the number of articles increased over the years, except for CBCT-to-CT and sCT for PET attenuation correction, which was stable in the last years.

Figure 3: (Top) Number of published articles grouped by application and year; (middle) pie charts of the anatomical regions investigated for each application; (bottom) bar plot of the publications binned per the total number of patients included in the study.

Figure 3 shows that brain, pelvis and HN were the most popular anatomical regions investigated in deep learning-based sCT for MR-only RT, covering 80% of the studies. For CBCT-to-CT, HN and pelvic regions were the most popular, being investigated in 75% of the studies. Finally, for PET AC HN was investigated in most of the studies followed by the pelvic region covering together 75% of the publications.

The total number of patients included in the studies was variable, but most studies dealt with 50 patients for all three categories. The most extensive three studies included 402 Andres2020 (I), 328 eckl2020evaluation (II) and 193 patients Peng2020 (I), while the smallest studies included ten patients Qian2020 and another ten volunteers Xu2019multi(I).

Most papers included adult patients. Paediatric (paed) patients represent a more heterogeneous dataset for network training and its feasibility has been investigated first for attenuation correction in PET ladefoged2019deep (79 patients) and more recently for photon and proton RTMaspero2020, Florkow2020dose.

All the models were trained to perform a regression task from the input to sCT, except for two studies where networks were trained to segment the input image into a pre-defined number of classes, performing a segmentation task Jeon2019, bradshaw2018feasibility.

In most of the works, training was performed in a paired manner, with unpaired training investigated in 13/83 articles. Four studies compared paired against unpaired Fu2020, Peng2020, Li2020comp, Xu2020. The 2D networks were the most common over the three categories, being adopted about 61% of the times, 2D+ 6%, 2.5D 10%, and 3D configuration 24%. In some studies, multiple configurations were investigated for example Fu2019, Neppl2019, Fu2020. GANs were the most popular architectures (45-times), followed by U-nets (36) and other CNNs. Note that the GAN generator a U-net may be employed, but this counted as GAN.

All the investigations employed registration between sCT and CT to evaluate the quality of the sCT, except for Xu et al. Xu2020 and Fetty et al. Fetty2020, where metrics were defined to assess the quality of the sCT in an unpaired manner, e.g. Frechet inception distance (FID).

Main findings are reported in Table 2 for studies on sCT for MR-only RT without dosimetric evaluations, in Table 3a, 3b for studies on sCT for MR-only RT with dosimetric evaluations, in Table 4 for studies on CBCT-to-CT for IGART, and in Table 5 for studies on PET attenuation correction. Tables are organised by anatomical site and tumour location where available. Studies investigating the independent training and testing of several anatomical regions are reported for each specific site Xu2020, Xiang2018, Cusumano2020, eckl2020evaluation, harms2019paired. Studies using the same network to train or test data from different scanners and anatomy are reported at the bottom of the table maspero2020single, zhang2020improving. Detailed results based on these tables are presented in the following sections subdivided for each category.

Iii.A. MR-only radiotherapy

The first work ever published in this category, and in among all the categories, was by Han in 2017, where he proposed to use a paired U-net for brain sCT generation. After one year, the first work published with a dosimetric evaluation was presented by Maspero et al.Maspero2018 investigating a 2D paired GAN trained on prostate patients and evaluated on prostate, rectal and cervical cancer patients.

width=center Tumor Patients MRI DL method Reg Image-similarity Reference site train val test x-fold field sequence conf arch MAE PSNR SSIM others [T] [HU] [dB] Abd Abdomen 10 10 LoO n.a. mDixon 2D pair GAN def 613 CC Xu2019Xu2019multi Abdomen 160 LoO n.a. n.a. 2D pair GAN rig 5.10.5 .90.43 (F/M)SIM IS … Xu2020Xu2020 Brain Brain 18 6x 1.5 3D T1 GRE 2D pair U-net rig 8517 MSE, ME Han2017Han2017 Brain 16 LoO n.a. T1 2.5Dp pair CNN+ rig 859 27.31.1 Xiang2018Xiang2018 Brain 15 5x 1.0 T1 Gd 2D pair CNN GAN def 10211 8910 25.41.1 26.61.2 .79.03 .83.03 tissues Emami2018Emami2018 Brain 98CT 84MR 10 3 3D T2 2D pair/unp GAN aff 193 65.40.9 .25.01 Jin2019Jin2019 Brain 24 LoO n.a. T1 3Dp pair GAN rig 569 26.62.3 NCC, HD body Lei2019Lei2019mri Brain 33 LoO n.a. T1 2D unp GAN No 9.00.8 .750.77 (F/M)SIM IS … Xu2020Xu2020 Brain 28 2 15 1.5 n.a. 2D pair GAN aff 13412 24.00.9 .76.02 Yang2020Yang2020 81 11 8x 1.5 3D T1 GRE 2D pair U-net aff 45.48.5 43.02.0 .65.05 metrics for air Massa2020Massa2020 Brain 3D T1 GRE Gd 44.67.4 43.41.2 .63.03 air, bones, 2D T2 SE 45.78.8 43.41.2 .64.03 soft tissues; 2D T2 FLAIR 51.24.5 44.91.2 .61.04 DSC bones Brain 28 6 1.5 T2 2D pair U-net rig 654 28.80.6 .972.004 same metrics for Li2020Li2020comp 2D unp GAN 946 26.30.6 .9550.007 synthetic MRI Head & Neck Nasophar 23 10 1.5 T2 2D pair U-net def 13124 MAE ME tissue/bone Wang2019Wang2019 HN 28 4 8x 1.5 2D T1Gd, T2 2D pair GAN aff 7615 29.11.6 .92.02 DSC MAE bone Tie2020Tie2020 HN 60 30 3 T1 2D unp GAN n.a. 19.60.7 62.40.5 .780.2 Kearney2020Kearney2020 HN 7 8 LoO 1.5 3D T1, T2 2D pair GAN def 8349 ME Largent2020Largent2020 HN 10 LoO 1.5 3D T1, T2 2D pair GAN def 42-62 RMSE, CC Qian2020Qian2020 HN 32 8 5x 3 3D UTE 2D pair U-net def 10421 DSC, spatial corr Su2020Su2020 Pelvis Prostate 22 LoO n.a. T1 2.5Dp pair CNN+ rig 433 33.50.8 Xiang2018Xiang2018 Pelvis 20 LoO n.a. 3D T2 3Dp pair GAN rig 5116 24.52.6 NCC, HD body Lei2019Lei2019mri Prostate 20 5x 1.5 2D T1 TSE 2D pair 3Dp pair U-net def 415 385 DSC bone Fu2019Fu2019 Pelvis human 27 3x 3 3D T1 GRE 3Dp U-net def 328 36.51.6 MAE/DSC bone Florkow2019Florkow2020 Pelvis canine 18 1.5 mDixon pair 364 36.11.7 surf dist0.5 mm Pelvis 15 4 5x 3 3D T2 2D pair CNN U-net def 386 439 29.51.2 28.21.6 .96.01 .95.01 ME, PCC Bahrami2020Bahrami2020 Pelvis 100 3 2D T2 FSE 2D unp GAN No FID Fetty2020Fetty2020 Thor Breast 14 2 LoO n.a. n.a. 2D pair U-net def DSC .74-.76 Jeon2019Jeon2019  
volunteers, not patients; to segment CT into 5-classes; multiple combinations of Dixon images was investigated but omitted here; dataset from http://www.med.harvard.edu/AANLIB/; robustness to training size was investigated. Abbreviations: val=validation, x-fold=cross-fold, conf=configuration, arch=architecture, GRE=gradient echo, (T)SE=(turbo) spin-echo, mDixon = multi-contrast Dixon reconstruction, LoO=leave-one-out, (R)MSE=(root) meas squared error, ME=mean error, DSC=dice score coefficient, (N)CC=normalized cross correlation.

Table 2: Overview sCT methods for MR-only radiotherapy with sole image-based evaluation.

width=center Tumor Patients MRI DL method Reg Image-similarity Plan Dose Reference site train val test x- field sequence conf arch MAE PSNR others DD GPR DVH others fold [T] [HU] [dB] [%] [%] Abdomen Liver 21 LoO 3 3D T1 3Dp GAN def 7318 22.73.6 NCC 99.41.0 1% range LiuY2019Liu2019 GRE pair Abdomen 12 4x 0.3 GRE 2D pair GAN def 9019 27.41.6 x 0.6 98.71.5 0.15 Fu2020Fu2020 1.5 2D unp 9430 27.22.2 +B 0.6 98.51.6 Abdomen 46 31 3x 3 3D T1 2.5D pair U-net syn 7918 MAE ME x 2Gy Liu2020Liu2020 GRE rig organs Abdomen 39 19 0.35 GRE 2D pair U-net def 7918 ME x+B 0.1 98.71.1 2.5% Cusumano2020Cusumano2020 Abdomen 54 18 12 3x 1.5 3D T1 3Dp U-net def 6213 30.01.8 ME, DSC x 0.1 99.70.3 2% beam Florkow2020Florkow2020dose paed 3 GRE, T2 TSE pair tissues 0.5 96.24.0 3% depth Brain Brain 26 2x 1.5 3D T1 m2D pair CNN rig 6711 ME tissues x -0.10.3 99.80.7 beam Dinkla2018Dinkla2018 GRE DSC dist body depth Brain 40 10 1.5 3D T1 2D pair CNN def 7523 DSC x 0.20.5 99.2 LiuF2019LiuF2019 GRE Gd Brain 54 9 14 5x 1.5 2D T1 2D pair GAN rig 4711 each x -0.70.5 99.20.8 1% 2D/3D Kazemifar2019Kazemifar2019 SE Gd fold Brain 55 28 4 1.5 3D T1 2D pair U-net rig 11626 ME x ,982 range Neppl2019Neppl2019 GRE 3Dp pair 13732 ,973 Brain 25 2 25 1.5 3D T1 3Dp GAN rig 557 ME x 2 98.43.5 1.65% range Shafai2019Shafai2019 GRE pair DSC Brain 47 13 5x 3 T1 2D pair U-net rig 8115 ME air, x 2.30.1 align Gupta2019Gupta2019 tissues CBCT Brain 12 2 1 LoO 3 3D T1 2D+ pair U-net rig 547 ME, DSC 0.000.01 range Spadea2019Spadea2019 GRE tissues Brain 15 5x n.a. T1, T2 2Dp pair GAN def 10824 tissues x 0.7 99.21.0 1% beam Koike2019Koike2020 FLAIR depth Brain 30 10 20 3x 1.5 3D T1 2D+ pair GAN rig 6114 26.71.9 ME DSC x -0.10.3 99.50.8 1% beam Maspero2020Maspero2020 paed 3 GREGd SSIM 0.10.4 99.61.1 3% depth Brain 66 11 5x 1.5 2D T1 2D unp GAN rig 7811 0.30.3 99.21.0 3% beam Kazemifar2020Kazemifar2020dose SE Gd depth Brain 242 81 79 3 3D T1 3Dp CNN def 8122 tissues x 0.130.13 99.60.3 0.15 Andres2020Andres2020 1.5 GREGd pair U-net 9021 0.310.18 99.40.5  
comparison with other architecture has been provided = , = , = ; trained in 2D on multiple view and aggregated after inference robustness to training size was investigated multiple combinations (also Dixon reconstruction, where present) of the sequences were investigated but omitted; data from multiple centers

Table 3a: a. Overview sCT methods for MR-only radiotherapy with image-based and dose evaluation.

width=center Tumor Patients MRI DL method Reg Image-similarity Plan Dose Reference site train val test x- field sequence conf arch MAE PSNR others DD GPR DVH others fold [T] [HU] [dB] [%] [%] Pelvis Prostate 36 15 3 T2 TSE 2D pair U-net def 305 ME tissues x 0.160.09 99.4 0.2Gy Chen2018Chen2018 Prostate 39 4x 3 3D T2 2D pair U-net def 338 ME DSC dist body x -0.010.64 98.50.7 Arabi2018Arabi2018 Prostate 17 LoO 1.5 T2 3Dp GAN rig 5117 24.22.5 NCC, bone: -0.070.07 986 1% range, LiuY2019bLiuY2019b unp dist, uniform peak, Prostate 25 14 3x 3 3D T2 2D pair U-net def 348 tissues x 1% 99.21 1% Largent2019Largent2019 TSE GAN 348 ME 1% 99.11 Pelvis 11 8 3 T2 2D pair GAN def 496 ME x 0.70.4 99.21.0 1.5% Boni2020Boni2020 1.5 TSE organs Pelvis 26 15 10+19 0.35 3D T2 2.5D pair GAN def 414 31.41 ME MSE x 1 1.5% Fetty2020Fetty2020dose 1.5/3 bone Pelvis 39 14 0.35 GRE 2D pair U-net def 5412 tissues x+B 0.5 99.00.7 1% Cusumano2020Cusumano2020 Rectum 46 44 1.5 3D T2 2D pair GAN def 357 ME x 0.8 99.80.1 1% Bird2020Bird2021 bone Head & Neck HN 34 3x 1.5 3D T2 3Dp U-net def 759 ME x -0.070.22 95.62.9 Dinkla2019Dinkla2019 TSE pair DSC bone HN 15 12 3 T1 2Dp GAN def 682 SSIM 0.5 98 0.5 Klages2019Klages2020 GRE pair RMSE HN 30 15 3 T1Gd 2D pair GAN rig 7012 29.41.3 SSIM -0.30.2 97.80.9 Qi2020Qi2020 T2 TSE U-net 7112 29.21.3 DSC, DRR -0.20.2 97.61.3 HN 135 10 28 3 3D T1 2D pair GAN def 709 ME, DSC x -0.10.3 98.71.0 1.5% beam Peng2020Peng2020 GRE 2D unp 1018 tissues 0.10.4 98.51.1 1.5% depth H&N 27 3x 3 3D T1 2D+ GAN def 654 ME p 0.2 93.53.4 1.5% NTCP Thummerer2020Thummerer2020comparison GRE pair DSC RS Thor Breast 12 18 LtO 1.5 3D GRE 2Dp GAN def 9411 NCC 0.5 98.43.5 DRR Olberg2019Olberg2019 mDixon pair 10315 dis bone Multiple sites with one network Prostate 32 27 3 3D T1 2D pair GAN rig 606 ME x -0.30.4 99.40.6 1% Maspero2018Maspero2018 Rectum 18 1.5 GRE 565 -0.30.5 98.51.1 Cervix 14 1.5/3 mDixon 596 -0.10.3 99.61.9  
comparison with other architecture has been provided = , = , = ; trained in 2D on multiple view and aggregated after inference robustness to training size was investigated multiple combinations (also Dixon reconstruction, where present) of the sequences were investigated but omitted; data from multiple centers

Table 3b: Overview sCT methods for MR-only radiotherapy with image-based and dose evaluation.

Considering the imaging protocol, we can observe that most of the MRI were acquired at 1.5 T (51.9%), followed by 3 T (42.6%), and the remaining 6.5% at 1 T or 0.35/0.3 T. The most popular MRI sequences adopted depends on the anatomical site: T1 gradient recalled-echo (T1 GRE) for abdomen and brain; T2 turbo spin-echo (TSE) for pelvis and H&N. Unfortunately, for more than ten studies either sequence or magnetic field were not adequately reported.
Generally, a single MRI sequence is used as input. However, eight studies investigated using multiple input sequences or Dixon reconstructions Xu2019multi, Tie2020, Kearney2020, Florkow2020, Koike2020, Maspero2018, Florkow2020dose, Olberg2019 based on the assumption that more input contrast may facilitate sCT generation. Some studies compared the performance of sCT generation depending on the sequence acquired. For example, Massa et al. Massa2020 compared sCT from the most adopted MRI sequences in the brain, e.g. T1 GRE with (+Gd) and without Gadolinium (-Gd), T2 SE and T2 fluid-attenuated inversion recovery (FLAIR), obtaining lowest MAE and highest PSNR for T1 GRE sequences with Gadolinium administration. Florkow et al. Florkow2020 investigated how the performance of a 3D patch-based paired U-net was impacted by different combinations of T1 GRE images along with its Dixon reconstructions, finding that using multiple Dixon images is beneficial in the human and canine pelvis. Qi et al. Qi2020 studied the impact of combining T1 (Gd) and T2 TSE obtaining that their 2D paired GAN model trained on multiple sequences outperformed any model on a single sequence.
When focusing on the DL model configuration, we found that 2D models were the most popular ones, followed by 3D patch-based and 2.5D models. Only one study adopted a multi-2D (m2D) configuration Dinkla2018. Three studies also investigated whether the impact of combining sCTs from multiple 2D models after inference (2D+) shows that 2D+ is beneficial compared to single 2D view Spadea2019, Klages2020, Maspero2020. When comparing the performances of 2D against 3D models, Fu et al. Fu2019 found that a modified 3D U-net outperformed a 2D U-net; while Neppl et al. Neppl2019 one month later published that their 3D U-net under-performed a 2D U-net not only on image similarity metrics but also considering photon and proton dose differences. These contradicting results will be discussed later. Paired models were the most adopted, with only ten studies investigating unpaired training Xu2020, Jin2019, Li2020comp, Kearney2020, Fetty2020, Kazemifar2020dose, LiuY2019b, Fu2020, Peng2020, Yang2020. Interestingly, Li et al. Li2020comp compared a 2D U-net trained in a paired manner against a cycle-GAN trained in an unpaired manner, finding that image similarity was higher with the U-net. Similarly, two other studies compared 2D paired against unpaired GANs achieving slightly better similarity and lower dose difference with paired training in the abdomen Fu2020 and H&N Peng2020. Mixed paired/unpaired training was proposed by Jin et al. Jin2019 who found such a technique beneficial against either paired or unpaired training. To improve unpaired training, Yang et al. Yang2020 found that structure-constrained loss functions and spectral normalisation ameliorated performances of unpaired training in the pelvic and abdominal regions.
An interesting study on the impact of the directions of patch-based 2D slices, patch size and GAN architecture was conducted by Klages et al. Klages2020 who reported that 2D+ is beneficial against a single view (2D) training, overlapping/non-overlapping patches is not a crucial point, and that upon good registration training of paired GANs outperforms unpaired training (cycle-GANs).
If we now turn to the architectures employed, we can observe that GAN covers the majority of the studies (55%), followed by U-net (35%) and other CNNs (10%). A detailed examination of different 2D paired GANs against U-net with different loss functions by Largent et al. Largent2019 showed that U-net and GANs could achieve similar image- and dose-base performances. Fetty et al. Fetty2020dose focused on comparing different generators of a 2D paired GAN against the performance of an ensemble of models, finding that the ensemble was overall better than single models being more robust to generalisation on data from different scanners/centres. When considering CNNs architectures, it is worth mentioning using 2.5D dilated CNNs by Dinkla et al. Dinkla2018 where the m2D training was claimed to increase the robustness of inference in a 2D+ manner maintaining a big receptive field and a low number of weights.

An exciting aspect investigated by four studies is the impact of the training size Maspero2020, Andres2020, Olberg2019, Peng2020, Yang2020, which will be further reviewed in the discussion section.

Finally, when considering the metric performances, we found that 21 studies reported only image similarity metrics, and 30 also investigated the accuracy of sCT-based dose calculation on photon RT (19), proton RT (8), or both (3). Two studies performed treatment planning, considering the contribution of magnetic fields Cusumano2020, Fu2020, which is crucial for MR-guided RT. Also, only four publications studied the robustness of sCT generation in a multicentric setting Andres2020, Boni2020, Maspero2020, Bird2021.

Overall, DL-based sCT resulted in DD on average 1% and GPR95%, except for one study Thummerer2020comparison. For each anatomical site, the metrics on image similarity and dose were not always calculated consistently. Such aspect will be detailed in the next section.

Iii.B. CBCT-to-CT generation

CBCT-to-CT conversion via DL is the most recent CT synthesis application, with the first paper published in 2018kida2018cone. Some of the works (5 out of 15) focused only on improving CBCT image quality for better IGRTkida2018cone, harms2019paired, chen2020synthetic, kida2020visual, yuan2020convolutional. The remaining 10 proved the validity of the transformation with dosimetric studies for photons liang2019generating, li2019preliminary, Maspero2020, Liu2020, barateau2020comparison, eckl2020evaluation, protons Thummerer2020comparison and for both photons and protonsLandry2019comparing, kurz2019cbct, zhang2020improving.
Only three studies investigated unpaired trainingkurz2019cbct, liang2019generating, maspero2020single; in eleven cases, paired training was implemented by matching the CBCT and ground truth CT by rigid or deformable registration. In Eck et al. eckl2020evaluation, however, CBCT and CT were not registered for the training

width=center Tumor Patients DL method Reg Image-similarity Plan Dose Reference site train val test x- conf arch MAE PSNR SSIM others DD DPR GPR DVH others fold [HU] [dB] [%] [%] [%] Abd Pancreas 30 LoO 3Dp pair GAN def 56.913.8 28.82.5 .71.03 NCC SNU x <1Gy Liu 2020liu2020cbct Thor Thorax 53 15 2D pair GAN def 9432 ME DSC HD tis x 76.717.3 93.85.9 2.6 Eckl2020eckl2020evaluation Brain 24 LoO 3Dp GAN rig 132 37.52.3 NCC No Harms2019harms2019paired Pelvis Pelvis 20 pair 165 30.73.7 SNU Prostate 16 4 5x 2D pair U-net def 50.9 .967 SNU No Kida2018kida2018cone RMSE Prostate 27 7 8 2D pair U-net def 58 ME x 98.4 99.5 DPR Landry2019Landry2019comparing 88.5 96.5 DPR RS Prostate 18 8 4x 2D ens unp GAN rig 875 ME x 99.90.3 80.55 95.92.0 <1.5% 1% DPR DPR RS Kurz2019kurz2019cbct Prostate 16 4 2D pair GAN rig SSIM diffROI No Kida2019kida2020visual Pelvis 205 15 2D pair GAN def 425 ME DSC HD tis x 88.99.3 98.51.7 1 Eckl2020eckl2020evaluation H&N 81 9 20 2D unp GAN def 29.94.9 30.71.4 .85.03 RMSE phantom x 98.41.7 96.33.6 Liang2019 liang2019generating Nasophar 50 10 10 2D pair U-net rig 6-27 ME organs x 0.20.1 95.51.6 1% Li2019li2019preliminary H&N 30 7 7 2D pair U-net rig 18.98 33.26 0.8911 RMSE tissues No Chen2019chen2020synthetic H&N 50 10 2.5D pair U-net rig 49.28 14.25 .85 SNR No Yuan2020yuan2020convolutional H&N 22 11 3x 2D pair U-net def 366 ME DSC SNU -0.10.3 98.11.2 RS Thummerer2020thummerer2020CBCT H&N 30 14 2D pair GAN def 82.410.6 ME tissues x 91.05.3 1Gy 1% Barateau2020barateau2020comparison H&N 25 15 2D pair GAN def 77.216.6 ME DSC HD tis x 91.54.3 95.02.4 2.4 Eckl2020eckl2020evaluation Multiple sites with one network HN Lung Breast 15 15 15 8 8 8 10 10 10 2D unp GAN rig 5312 8310 6618 30.52.2 28.51.6 29.02.1 .81.04 .78.04 .76.02 ME x 0.10.5 0.20.9 0.10.4 97.81 94.93 928 <2% Maspero2020maspero2020single Pelvis H&N 135 15 15 10 10x 2.5D pair GAN def 245 244 20.13.4 22.83.4 x <1% RS Zhang2020zhang2020improving  
comparison with other architecture has been provided; dose pass rate (DPR) 1% or = ; DPR 2% or = ; DPR 3% or = ; trained in 2D on multiple view and aggregated after inference; different nets were trained e the different outputs were weighted to obtain final sCT robustness to training size was investigated;

Table 4: Overview sCT methods for adaptive radiotherapy with CBCT.

phase, as the authors claimed the first fraction CBCT was geometrically close enough to the planning CT for the network. Deformable registration was then performed for image similarity analysing. In this work, the quality of contours propagated to sCT from CT was compared to manual contours drawn on the CT to assess each step of the IGART workflow: image similarity, anatomical segmentation and dosimetric accuracy. The network, a 2D cycle GAN implemented on a vendor’s provided research software, was independently trained and tested on different sites, H&N, thorax and pelvis, leading to best results for the pelvic region.

Other authors studied training a single network with different anatomical regions. In Maspero et al. maspero2020single, authors compared the performances of three cycle-GANs trained independently on three anatomical sites (H&N, breast and lung) vs a single trained with all the anatomical sites together finding similar results in terms of image similarity.
Zhang et al. zhang2020improving trained a 2.5D conditional GANzhu2017unpaired with feature matching on a large cohort of 135 pelvic patients. Then, they tested the network on additional 15 pelvic patients acquired with a different CT scanner and ten H&N patients. The network predicted sCT with similar MAE for both testing groups, demonstrating the potentialities to transfer pre-trained models to different anatomical regions. They also compared different GAN flavours and U-net finding the latter statistically worse than any GAN configuration.
Three works tested unpaired training with cycle-GANs liang2019generating, kurz2019cbct, maspero2020single. In particular, Liang et al. liang2019generating compared unsupervised training among cycle-GAN, DCGANradford2015unsupervised and PGGANkarras2017progressive on the same dataset, finding the first to perform better both in terms of image similarity and dose agreement.

As it regards the body region, most of the studies were focused on H&N and pelvic region. Liu et al. liu2020cbct investigated CBCT-to-CT in the framework of breath hold stereotactic pancreatic radiotherapy, where they trained a 3D patch cycle-GAN introducing an attention gate (AG)oktay2018attention to deal with moving organs. They found that the cycle-GAN with AG performed better then U-net and cycle-GAN without AG. Moreover, the DL approach led to a statistically significant improvement of the replanning on sCT vs. CBCT although some residual discrepancies were still present for this particular anatomical site.

Iii.C. PET attenuation correction

DL methods for deriving sCT for PET AC have been published since 2017 leynes2017direct. Two possible image translations are available in this category: i) MR-to-CT for MR attenuation correction (MRAC) where 14 papers were found; ii) uncorrected PET-to-CT, with three published articles.

In the first case, most methods have been tested with paired data in H&N (9 papers) and the pelvic region (4 papers) except Baydoun et al. Baydoun2020 who investigated the thorax district. The number of patients used for training ranged between 10 and 60. Most of the MR images employed in these studies have been acquired directly through 3T PET/MRI hybrid scanners, where specific MR sequences, such as UTE (ultra-short echo time) and ZTE (zero time echo) are used to enhance short tissues, such as in the cortical bone and Dixon reconstruction is employed to derive fat and water images.
Leynes et al.leynes2017direct compared the Dixon-based sCT vs sCT predicted by U-net receiving both Dixon and ZTE. Results showed that DL prediction reduced the RMSE in corrected PET SUV by a factor 4 for bone lesions and 1.5 for soft tissue lesions. Following this first work, other authors showed the improvement of DL-based AC over the traditional atlas-based MRAC proposed by the vendors gong2018attenuation, Jang2018deep, torrado2019dixon, ladefoged2019deep, blanc2019attenuation, Baydoun2020, gong2020b, also comparing several network configurations pozaruk2020augmented, gong2020a.

width=center Region Patients MRI DL method Reg Image-similarity PET-related Others Reference train val test x- field contrast conf arch MAE DSC tracer PETerr fold [T] [HU] [%] Pelvis 10 16 3 Dixon ZTE 3Dp pair U-net def F-FDG Ga-PSMA RMSE SUV diff Leynes2017leynes2017direct Pelvis 15 4 4 3 T1 GRE Dixon 2D pair U-net def F-FDG 1.82.4 1.72.0 1.82.4 3.83.9 -map diff Torrado2019torrado2019dixon Pelvis 12 6 3 T1 GRE T2 TSE 3Dp pair CNN def .99.00 .48.21 .94.01 .880.03 .980.01 F-FDG RMSE Bradshaw2018bradshaw2018feasibility Prostate 18 10 3 Dixon 2D pair GAN def Ga-PSMA .75.64 .52.62 SSIM -map diff Pozaruk2020pozaruk2020augmented Head 30 10 5 1.5 T1 GRE Gd 2D pair CNN def .971.005 .936.011 .803.021 n.a. -0.71.1 Liu2018liu2018deep Head 30+6 8 1.5+3 UTE 2D pair U-net def .76.03 .96.01 .88.01 F-FDG 1 Jang2018Jang2018deep H&N 32 8 5 3 Dixon 2D pair U-net rig 13.81.4 .76.04 F-FDG 3 Gong2018gong2018attenuation 12 2 7 ZTE 12.61.5 .80.04 Head 60 19 4 3 mDixon 3Dp U-net rig .90.07 F-FET biol tumor Ladefoged2019ladefoged2019deep paed +UTE pair vol, SUV Head 40 2 3 T1 GRE 3Dp pair GAN def 10140 30279 407228 105 .80.07 F-FDG 3.23.4 1.213.8 3.213.6 3.213.6 rel vol dif surf dist ME RMSE PSNR SSIM SUV Arabi2019arabi2019novel Head 44 11 11 1.5 T1 GRE 2.5D pair U-net rig C-WAY C-DASB -0.491.7 -1.52.73 synt -map, kin anal Spuhler2019spuhler2019synthesis Head 23 47 3 ZTE 3Dp pair U-net def .81.03 F-FDG -0.25.6 Jac Blanc-Durand2019 blanc2019attenuation Head 32 4 3 Dixon 3Dp pair GAN def 15.82.4% .74.05 F-FDG -1.013 SUV Gong2020a gong2020a Head 35 5 3 mDixon UTE 2.5D pair U-net rig 10.94.01% .87.03 C-PiB F-MK 2 Gong2020bgong2020b Thorax 14 LoO 3 Dixon 2D pair GAN def 67.459.89 F-NaF PSNR SSIM RMSE Baydoun2020Baydoun2020 Other than MR-based sCT Body 100 28 PET, no att corrected 2D pair U-net Y 11116 .94.01 F-FDG -0.62.0 abs err Liu2018Liu2018pet Body 80 39 PET, no att corrected 3Dp pair GAN Y 10919 .87.03 F-FDG 1.0 NCC PSNR ME Dong2019Dong2019synthetic Body 100 25 PET, no att corrected 2.5D pair GAN Y F-FDG -0.88.6 SUV ME Armanious2020Armanious2020  
comparison with other architecture has been provided; data from another MRI sequence used as pre-training; patients acquired with different scanner; MRI data from hybrid PET/MRI scanner; in SUV max; in SUV mean; in air or bowel gas; in the bony structures; in the soft tissue; in the fatty tissue; in water; trained to segment the CT/sCT into classes;

expressed in terms of Jaccard index and not DSC;

multiple combinations (alsoDixon reconstruction, where present) of the sequences were investigated but omitted; intrinsically registered: PET-CT data;

Table 5: Overview methods on sCT for PET AC.

Torrado et al. torrado2019dixon pre-trained their U-net on 19 healthy brains acquired with GRE MRI and, subsequently, they trained the network using Dixon images of colorectal and prostate cancer patients. They showed that pre-training led to faster training with a slightly smaller residual error than U-net weights’ random initialisation.
Pozaruk et al. pozaruk2020augmented proposed data augmentation, over 18 prostate cancer patients, by perturbing the deformation field used to match the MR/CT pair for feeding the network. They compared the performance of GAN with augmentation vs 1) Dixon based and 2) Dixon + bone segmentation from the vendor, 3) U-net with and 4) without augmentation. They found significant differences between the 3 DL methods and classic MRAC routines. GAN with augmentation performed slightly better than the U-net with/without augmentation, although the differences were not statistically relevant.
Gong et al. gong2020a used unregistered MR/CT pair for a 3D patch cycle GAN, comparing the results vs atlas-based MRAC and CNN with registered pair. Both DL methods performed better than atlas MRAC in DSC, MAE and , no significant difference was found between CNN and cycle-GAN. They concluded that cycle-GAN has the potentiality to skip the limit of using a perfectly aligned dataset for training. However, it requires more input data to improve output.
Baydoun et al. Baydoun2020 tried different network configurations (VGG16simonyan2014very, VGG19simonyan2014very, and ResNethe2016deep) as a benchmark with a 2D conditional GAN receiving either two Dixon input (water and fat) or four (water, fat, in-phase and opposed-phase). The GAN always performed better than VGG19 and ResNet, with more accurate results obtained with four inputs.

In the effort to reduce the time for image acquisition and patient discomfort, some authors proposed to obtain the sCT directly from diagnostic images, or -weighted both using images from standalone MRI scanners liu2018deep, Arabi2018, spuhler2019synthesis or hybrid machines bradshaw2018feasibility. In particular, Bradshaw et al. bradshaw2018feasibility trained a combination of three CNNs with GRE and TSE MRI (single sequence or both) to derive an sCT stratified in classes (air, water, fat and bone), which was compared the with the scanner default MRAC output. The RMSE on PET reconstruction computed on SUV and was significantly lower with the deep learning method and / input. However recently, Gong et al. gong2020b tested on a brain patient cohort a CNN with either or Dixon and multiple echo UTE (mUTE) as input. The latter over-performed the others. Liu et al. liu2018deep trained a CNN to predict CT tissue classes from diagnostic 1.5 T GRE of 30 patients. They tested on 10 independent patients of the same cohort, whose results are reported in table 5 in terms of DSC. Then, they predicted sCT for 5 patients acquired prospectively with a 3T MRI/PET scanner ( GRE) and they computed the , resulting 1%. They concluded that DL approaches are flexible and promising to be applied also to heterogeneous datasets acquired with different scanners and settings.

DL methods have also been proposed to estimate sCT from uncorrected PET. Thanks to the more considerable number of single PET exams, these methods have been tested on the full-body acquisitions and larger patient populations (up to 100 for training and 39 for testing). Although the global MAE is higher than site-specific MR-to-CT studies (about 110HU vs  10-15 HU), is below 1% on average, demonstrating the validity of the approach for the scope of PET AC.

Iv. Discussion

This review encompassed DL-based approaches to generate sCT from other radiotherapy imaging modalities, focusing on published journal articles. The research topic was earlier introduced at conferences from 2016 nie2016estimating

. Since 2016, we have observed increasing interest in using DL for sCT generation. DL methods’ success is probably related to the growth of available computational resources in the last decade, which allowed training large volume datasets 

Lecun2015 thus achieving fast image translation (i.e. in the order of few seconds vanDyk2020) making possible to apply DL in clinical cases and demonstrate its feasibility for clinical scenarios. In this review, we considered three clinical purposes for deriving sCT from other image modality, which are discussed in the following:

  1. [label=]

  2. MR-only RT. The generation of sCT for MR-only RT with DL is the most populated category. Its papers demonstrate the potential of using DL for sCT generation from MRI. Several training techniques and configurations have been proposed. For anatomical regions, as pelvis and brain/H&N, high image similarity and dosimetric accuracy can be achieved for photon RT and proton therapy. In region strongly affected by motion Stemkens2018, Paganelli2018, e.g. abdomen and thorax, the first feasibility studies seem to be promising LiuY2019b, Olberg2019, Fu2020, Cusumano2020, Florkow2020dose. However, no study proposed the generation of DL-based 4D sCT yet, as from non deep learning-based methods Freedman2019. An exciting application is the DL-based sCT generation for the paediatric population, who is considered more radiation-sensitive than an adult population Goodman2019 and could enormously benefit from MR-only, especially in the case of repeated simulations Karlsson2009. The methods for sCT generation for brain Maspero2020 and abdominal Florkow2020dose cases achieved encouraging photon and proton RT results. The geometric accuracy of sCT needs to be thoroughly tested to enable the clinical adaption of sCT for treatment planning purposes, especially when MRI or sCT are used to substitute CT for position verification purposes. So far, the number of studies that focused investigated such an aspect from DL-based sCT is still scarce. Only Gupta et al. Gupta2019 for brain and Olberg et al. Olberg2019 for breast cancer have investigated this aspect assessing the accuracy of alignment based on CBCT and digitally reconstructed radiography, respectively. Future studies are required to strengthen the clinical use of sCT. MR-only RT can potentially allow for daily image guidance and plan adaption in the context of MR-guided radiotherapyLagendijk2014, where the accuracy of dose calculation in the presence of the magnetic field needs to be assessed before clinical implementation. So far, the studies investigating this aspect are still few, e.g. for abdominal Fu2020 and pelvic tumours Cusumano2020 and only considered low magnetic fields. The results are promising, but we advocate for further studies on additional anatomical sites and magnetic field strengths.

  3. CBCT-to-CT for image-guided (adaptive) radiotherapy. In-room CBCT imaging is widespread in photon and proton radiation therapy for daily patient setup Boda2011. However, CBCT is not commonly exploited for daily plan adaption and dose recalculation due to the artefacts associated with scatter and reconstruction algorithms that affect the quality of the electron density predicted by CBCT elstrom2011evaluation. Traditional methods to cope with this issue have been based on image registration peroni2012automatic, veiga2015cone, on scatter correction park2015proton, look-up-table to rescale HU intensities kurz2016feasibility and histogram matching arai2017feasibility. The introduction of DL for converting CBCT to sCT has substantially improved image quality leading to faster results than image registration and analytical corrections thummerer2020CBCT. Speed is a crucial aspect for the translation of the method into the clinical routine. However, one of the problems arising in CBCT-to-CT conversion for clinical application, is the different field of view (FOV) between CBCT and CT. Usually, the training is performed by registering, cropping and resampling the volume to the CBCT size, which is smaller than the planning CT.
    Nonetheless, for replanning purposes, the limited FOV may not transfer the plan to the sCT. When this is the case, some authors have proposed to assign water equivalent density within the CT body contour for the missing information barateau2020comparison. In other cases, the sCT patch has been stitched to the planning CT to cover the entire dose volumemaspero2020single. Ideally, appropriate FOV coverage should be employed when transferring the plan for online adaptive RT. Beside the dosimetric aspect, improved image quality leads to more accurate image guidance for patient set-up and OAR segmentation, all necessary steps for online adaptive radiotherapy especially for anatomical sites prone to large movements, as speculated by Liu et el. liu2020cbct in the framework of pancreatic treatments. CBCT-to-CT has been proved both for photon and proton radiotherapy, where the setup accuracy and dose calculation are even more relevant to avoid range shift errors that could jeopardise the benefit of treatment paganetti2012range. Because there is an intrinsic error in converting HU to relative proton stopping power goma2018revisiting, it has been shown that deep learning methods can translate CBCT directly to stopping powerharms2020cone. This approach has not been covered in this review, but it is an interesting approach that will probably lead to further investigations.

  4. PET attenuation correction. The sCT in this category is obtained either from MR or from uncorrected PET. In the first case, the work’s motivation is to overcome the current limitations in generating attenuation maps (-maps) from MR images in MRI/PET hybrid acquisitions, where the bone contribution is miscalculated izquierdo2014comparison. In the second case, the limits to overcome are different: i) to avoid extra-radiation dose when the sole PET exam is required, ii) to avoid misregistration errors when standalone CT and PET machines are used, iii) to be independent of the MR contrast in MRI/PET acquisitions. Besides the network configuration, MRI used for the input, or the number of patients included in the studies, DL-based sCT have always outperformed current MRAC methods available on commercial software. The results of this review support the idea that DL-based sCT will substitute current AC methods, being also able to overcome most of the limitations mentioned above. These aspects seem to contradict the stable number of papers in this category in the last three years. Nonetheless, we have to consider that the recent trend has been to directly derive the -map from uncorrected PET via DL. Because this review considered only image-to-CT translation, these works were not included but can be found in a recent review by Lee lee2020review. However, it is worth to mention a recent study from Shiri et al. shiri2020deep, where the largest patient cohort ever (1150 patients split in 900 for training, 100 for validation and 150 for test) was used for the scope. Direct -map prediction via DL is an auspicious opportunity which may direct future research efforts in this context.

Deep learning considerations and trends
The number of patients used for training the networks is quite variable, ranging from a minimum of 7 (in I) Qian2020 to a maximum of 205 (in II) eckl2020evaluation, and 242 Andres2020

(in I). In most of the cases, the patient number is limited to the availability of training pairs. In the form of linear and non-linear transformation

shorten2019survey, data augmentation is performed to increase the training accuracy as demonstrated in Pozaruk et al. pozaruk2020augmented. However, few publications investigated the impact of increasing the training size Maspero2020, Peng2020, Olberg2019, yuan2020convolutional, Andres2020, finding that image similarity increases when training up to fifty patients. This provides some indications on the minimum amount of patients necessary to include in the training to achieve the state of the art performances. The optimal patient number may also depend on the anatomical site and its inter- and intra-fraction variability. Besides, attention should be dedicated to balancing the training set, as performed in Maspero2020, Andres2020. Otherwise, the network may overfit, as previously demonstrated for segmentation tasks li2019overfitting.

GANs were the most popular architecture, but we cannot conclude that it is the best network scheme for sCT. Indeed, some studies compared U-net or other CNN vs GAN finding GAN performing statistically better Baydoun2020, zhang2020improving; others found similar results pozaruk2020augmented, gong2020a or even worse performances  gong2020b, Li2020comp. We can speculate that, as demonstrated by Largent2019, a vital role is played by the loss function which, despite being the effective driver for network learning, has been investigated less than the network architecture, as highlighted for image restoration Zhao2018. Another important aspect is the growing trend, except for category III, in unpaired training (5 and 7 papers in 2019 and 2020, respectively). The quality of the registration when training in a paired manner influences the quality of deep learning-based sCT generation Florkow2019impact. In this sense, unpaired training offers an option to alleviate the need of well-matched training pairs. When comparing paired vs unpaired training, we observed that paired training lead to slightly better performances. However, the differences were not always statistically significant Peng2020, Yang2020, Li2020comp. As proposed by Yang et al. Yang2020, unsupervised training decreases the semantic information in going from one domain to an other Yang2020. Such an issue may be solved introducing a structure-consistency loss, which extracts structural features from the image defining the loss in the feature space. Yang et al.’s results showed improvements in this sense relative to other unsupervised methods. They also showed that pre-registering unpaired MR-CT further improves the results of unsupervised training, which can be an option when input and target images are available, but perfect alignment is not achievable. In some cases, unpaired training even demonstrated to be superior to paired training Wolterink2017. A trend lately emerged is the use of architecture initially thought for unpaired training, e.g. cycle-GAN to be used for paired training Lei2019mri, harms2019paired.
Focusing on the body sites, we observed that most of the investigations were conducted in the brain, H&N and the pelvic regions while fewer studies are available for the thorax and the abdomen, representing a more challenging patient population due to the organ motion rehman2020deep.

In the results of MR-only RT, we found contradicting results regarding the best performing spatial configuration for the papers that directly compared 2D vs 3D training Fu2019, Neppl2019. It is certainly clear that 2D+ increases the sCT quality compared to single 2D views Spadea2019, Maspero2020; however, when comparing 2D against 3D training the patch size is an important aspect Klages2020. 3D deep networks require a larger number of training parameters than 2D networks Singh20203d and for sCT generation, the approaches adopted have chosen to use patch size much smaller than the whole volume, probably hindering the contextual information considered. Generally, downsampling approaches have been proposed to increase the perceptive field of the network, e.g. for segmentation tasks Kamnitsas2016, but they have not been applied to sCT generation. We believe this will be an interesting area of research.

For what concerns the latest development from the deep learning perspective, in 2018, Oktay et al. oktay2018attention proposed a new mechanism, called attention gate (AG), to focus on target structures that can vary in shape and size. Liu et al. liu2020cbct incorporated the AG in the generator of a cycle-GAN to learn organ variation from CBCT-CT pairs in the context of pancreas adaptive RT, showing that its contribution significantly improved the prediction compared to the same network without AG. Other papers also adopted attention Yang2020, Kearney2020. Embedding has also been proposed to increase the expressivity of the network and applied by Xiang et al. Xiang2018 (I). As AG’s mechanism is a way to focus the on specific portions of the image, it can potentially open the path for new research topics. In 2019, Schlemper and colleagues schlemper2019attention evaluated the AG for different tasks in medical image processing: classification, object detection, segmentation. So, we can envision that in the online IGART such a mechanism could lead to multi-task applications, such as deriving sCT, while delineating the structure of interests.

Benefits and challenges for clinical implementations
Deep learning-based sCT generations may reduce the need of additional or non-standard MRI sequences, e.g. UTE or ZTE, which could lead to shorten the total acquisition time and speed-up the workflow or increase patient throughput. As already mentioned, speed is particularly interesting for MR-guided RT, but for adaptive RT in II is considered crucial too. For what concern categories II and III, the generation of DL-based sCT possibly enables dose reduction during imaging by reducing the need for CT in case of anatomical changes (in II) or by possibly reducing the amount of radioactive material injected (in III).

Finally, it is worth commenting on the current status of the clinical adoption of DL-based sCT. We could not find that any of the methods considered are now clinically implemented and used. We speculate that this is probably related to the fact that the field is still relatively young, with the first publications only from 2017 and that time for clinical implementations generally last years, if not decades Keeling2020, Bertholet2020. Additionally, as already mentioned, for categories I/II the impact of sCT for position verification still needs to be thoroughly investigated. Also, the implementation may be more comfortable for category III if the methods would be directly integrated into the scanner by the vendor. In general, the involvement of vendors may streamline the clinical adoption of DL-based sCT. In this sense, we can report that vendors are currently active validating their methods in research settings, e.g. for brain Andres2020, pelvis Bird2021 in I, and for H&N, thorax and pelvis in II eckl2020evaluation. In the last month, Palmer et al. Palmer2021synt also reported using a pre-released version of a DL-based sCT generation approach for H&N in MR-only RT. Another essential aspect that needs to be satisfied is the compliance to the currently adopted regulations MDR2017, where vendors can offer a vital support Fiorino2020technology.

A key aspect of clinical implementation is the precise definition of the requirements that a DL-based solution needs to satisfy before being accepted. If we consider the reported metrics, we cannot find uniform criteria on and to report. Multiple metrics have been defined, and it is not clear on which region of interests they should be computed. For example, the image-based similarity was reported on the body contour, or in tissues generally defined by different thresholds; for task-specific metrics the methods employed are even more heterogeneous. For example, in I and II, gamma pass rates can be performed in 2D, 3D and different dose thresholds level have been employed, e.g. 10%, 30%, 50% or 90% of the prescribed or the maximum dose. In III the can be computed either on the either SUV, max SUV or in larger VOI making difficult to compare performances of different network configurations. We think that this lack of standardisation in reporting the results is also detrimental for clinical adoption. A first attempt on revising the metrics currently adopted has been performed by Liesbeth et al. Liesbeth2020. However, this is still insufficient, considering the differences in how such metrics can be calculated and reported. In this sense, we advocate for consensus-based requirements that may facilitate reporting in future clinical trials Liu2020reporting. Also, no public datasets arranged in the form of grand challenges (https://grand-challenge.org/) are available to enable a fair and open evaluation of different approaches.
To date, four scientific studies have already investigated the performance of sCT in a multi-centre setting Maspero2020, Boni2020, Fetty2020dose, Bird2021. These studies have been reported only for MR-only RT. Future work should focus on assessing the performance of DL-based sCT generation for II and III. The quality of sCT cannot be judged by a user, except when its quality is inferior. Therefore, software-based quality assurance (QA) procedures should be put in place. It could be quite interesting to have a phantom to allow regular QA procedures, such as the case of the CT QA Mutic2003. This would be relatively straightforward for II; however, in MR-based sCT, the manufacturing of phantoms is quite challenging due the need of contrast in both MRI and CT images. Recently, the first phantoms have been proposed for such task Gallas2015, Niebuhr2019, Singhrao2020, Colvill2020 showing the potential of additive manufacturing.

Alternatively, it would be quite interesting if a CNN could automatically generate a metric to assess the quality of sCTs, as, for example, already proposed for automatic segmentation Chen2020cnn. In this sense, Bragman et al. Bragman2018uncertainty proposed using uncertainty for such a task adopting a multi-task network and a Bayesian probabilistic framework. More recently, two other works proposed to use uncertainty either from the combination of independently trained networks Maspero2020 or via dropout-based variational inference Hemsley2020. So far, the field of uncertainty estimation with deep learning Abdar2020review has been just superficially touched for sCT generation. It would be interesting to see future work focusing on developing criteria for the automatic identification of failure cases using uncertainty prediction. Patients with inaccurate synthetic CTs will be flagged for CT rescan, or manual adjustment of the sCT if deemed feasible.

Beyond sCT for radiotherapy
During the database search, we found other possible applications of DL-based image generation, which are beyond the categories mentioned so far or the radiotherapy application. For example, Kawahara et al. Kawahara2020 proposed to generate synthetic dual-energy CT from CT to assess the body material composition using 2D paired GANs. Also, commercial solutions start to be evaluated for the generation of DL-based sCT from MRI for lesion detection of suspected sacroiliitis Jans2020mri or to facilitate surgical planning of the spine Staartjes2021. An interesting application is also the generation of sCT to facilitate multi-modal image registration, as proposed by Mckenzie et al. Mckenzie2020.

Additionally, the methods here reviewed to generate sCT can be applied to translating other image modalities. Interesting examples in the RT realm are provided by Jiang et al. Jiang2019cross who investigated using MRI-to-CT translation to increase the robustness of segmentation, and Kieselmann et al. Kieselmann2020 who generated synthetic MRI from CT to enable training of segmentation networks that would exploit the wealth of delineation on another modality. A detailed review of other image-to-image translation applications in radiotherapy has been recently compiled by Wang et al. Wang2020review.

V. Conclusion

The deep learning-based generation of sCT has been reviewed to: I) replace CT in MR-based treatment planning, II) facilitate CBCT-based adaptive radiotherapy, and III) correct attenuation maps in PET. A detailed review of each category was presented, providing a comprehensive comparison among DL-based methods in terms of the most popular metrics reported. The essential contributions were highlighted identifying specific challenges. We found that DL-based sCT generation is an active an growing area of research. For several anatomical sites, e.g. H&N/brain and pelvis, sCT seems to be feasible with deep learning. While deep learning-based sCT generation techniques are up-and-upcoming, comprehensive commissioning and QA of deep learning-based sCT technique are critical prior and essential during clinical deployment to ensure patient safety.

Vi. Acknowledgements

Matteo Maspero is grateful to prof.dr.ir. Cornelis (Nico) A.T. van den Berg, head of the Computational Imaging Group for MR diagnostics & therapy, Center for Image Sciences, UMC Utrecht, the Netherlands for the general support provided during this manuscript’s compilation.

Vii. Conflict of interest

None of the authors has conflict of interests to disclose.

Appendix

The query used in selected databases - PubMed, Scopus and Web of Science - in the fields (Title/Abstract/Keywords) was the following (Figure 4):

((”radiotherapy”) OR (”radiation therapy”) OR (”proton therapy”) OR (”oncology”) OR (”imaging”) OR (”radiology”) OR (”healthcare”) OR (”CBCT”) OR (”cone-beam CT”) OR (”PET”) OR (”attenuation correction”) OR (”attenuation map”)) AND ((”synthetic CT”) OR (”syntheticCT”) OR (”synthetic-CT”) OR (”pseudo CT”) OR (”pseudoCT”) OR (”pseudo-CT”) OR (”virtual CT”) OR (”virtualCT”) OR (”virtual-CT”) OR (”derived CT”) OR (”derivedCT”) OR (”derived-CT”) OR (sCT)) AND ((”deep learning”) OR (”convolutional network”) OR (”CNN”) OR (”GAN”) OR (”GANN”) OR (artificial intelligence));

Figure 4: Schematic of the search inclusion/exclusion criteria adopted for this review selecting the time window, keywords, type of article, content and the three categories defined.

Viii. Acronyms and abbreviations

: attenuation map; AC: attenuation correction; aff: affine; AG: attention gate; CBCT: cone-beam computed tomography; CC: cross-correlation; CNNs: convolutional neural networks; cor: coronal; CT: computed tomography; cycle-GAN: cycle-consistent GAN; DD: dose difference; def: deformable; DL: deep learning; DPR: dose pass rate; DSC: dice similarity coefficient; DVH: dose-volume histogram; ens: ensemble; FLAIR: fluid-attenuated inversion recovery; FOV: field of view; GANs: generative adversarial networks; Gd: Gadolinium; GPR: gamma pass rate; GRE: gradient recalled-echo; H&N: head and neck; IGART: image-guided adaptive radiation therapy; MAE: mean absolute error; MR: magnetic resonance; MRAC: MR attenuation correction; MRI: magnetic resonance imaging; MSE: mean squared error; mUTE: multiple echo UTE; NCC: normalised cross-correlation; OAR: organ-at-risk; : proton; paed: paediatric; PET: positron emission tomography; PET: absolute error PET reconstruction; PET: relative error PET reconstruction; PSNR: peak signal-to-noise ratio; rig: rigid; RMSE: root mean squared error; ROI: region-of-interest; RS: range shift; RT: radiotherapy; sag: sagittal; sCT: synthetic computed tomography; SSIM: structural similarity index; SUV: standard uptake values; tra: transverse; TSE: turbo spin-echo; UTE: ultra-short echo time; VOI: volume of interest; x: photon;

References