OncoNet: Weakly Supervised Siamese Network to automate cancer treatment response assessment between longitudinal FDG PET/CT examinations

08/03/2021 ∙ by Anirudh Joshi, et al. ∙ Stanford University 5

FDG PET/CT imaging is a resource intensive examination critical for managing malignant disease and is particularly important for longitudinal assessment during therapy. Approaches to automate longtudinal analysis present many challenges including lack of available longitudinal datasets, managing complex large multimodal imaging examinations, and need for detailed annotations for traditional supervised machine learning. In this work we develop OncoNet, novel machine learning algorithm that assesses treatment response from a 1,954 pairs of sequential FDG PET/CT exams through weak supervision using the standard uptake values (SUVmax) in associated radiology reports. OncoNet demonstrates an AUROC of 0.86 and 0.84 on internal and external institution test sets respectively for determination of change between scans while also showing strong agreement to clinical scoring systems with a kappa score of 0.8. We also curated a dataset of 1,954 paired FDG PET/CT exams designed for response assessment for the broader machine learning in healthcare research community. Automated assessment of radiographic response from FDG PET/CT with OncoNet could provide clinicians with a valuable tool to rapidly and consistently interpret change over time in longitudinal multi-modal imaging exams.



There are no comments yet.


page 3

page 4

page 5

page 7

page 9

page 11

page 19

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cancer is one of the leading causes of death worldwide and accurate diagnosis, staging and restaging are essential to optimize therapeutic management. Advanced medical imaging techniques such as positron emission tomography (PET) coupled with computed tomography (CT) are integral for clinical assessment of cancer diagnosis, and assessment of treatment response. PET is the most sensitive non-invasive imaging modality capable of detecting picomolar amounts of radiolabeled sugar molecules trapped in cancer cells while CT provides high tissue resolution for precise localization. The clinical interpretation of PET/CT scans involves synthesizing multiple data sources: clinical information, the metabolic findings from PET and the anatomic information from CT.

In clinical practice, radiologists and nuclear medicine physicians must interpret consecutive PET/CT examinations to determine whether a cancer patient receiving treatment is appropriately responding to therapy. They do this by measuring whether the amount of malignant tissue is decreasing, unchanged or increasing across exams. This process is chiefly qualitative and sometimes subjective, but oncology treatment planning increasingly demands standardized, quantitative data [4]. Further, the process is extremely labor intensive and time-consuming and a stark rise in utilization of FDG-PET/CT imaging suggest that automation technologies would be of high impact in clinical oncologic imaging workflows.

Deep learning approaches have produced state of the art results for automated interpretation of various medical imaging modalities [9, 8, 19]

. Prior work in deep learning for PET/CT has demonstrated the ability of automated systems to detect and estimate locations of abnormalities in individual PET/CT imaging exams

[6]. However, as discussed above, there is a pressing need for automated methods capable of comparing across consecutive oncologic imaging studies and estimating changes in disease burden. If successful, automation of this clinically important task could improve routine clinical oncologic imaging workflows, enhance standardized quantification of imaging biomarkers for oncologic therapy trials, and contribute to operationalization of communications regarding response to therapy to referring clinicians and patients.

Existing approaches that compare consecutive medical images over time have been validated on 2D imaging modalities such as radiography and retinal fundus imaging [1, 16]. However, applying these techniques to PET/CT presents significant new challenges such as: (1) a single PET/CT exam is composed of hundreds of PET and CT image slices, so training a model to compare multiple complex PET/CT examinations is methodologically challenging and computationally expensive, (2) established scoring systems for measuring longitudinal changes are subjective, so human readers often produce inconsistent scores, (3) clinical information depicting longitudinal changes in consecutive PET/CT imaging studies are reported in narrative text reports that make it challenging to extract meaningful labels for deep learning training in large datasets, and (4) existing work in PET/CT are limited by small reported experimental datasets or reliance on phantoms which limits scientific advancement.

The purpose of this study is to model the task of longitudinal treatment response assesment on volumetric multi-modality oncologic PET/CT imaging examinations for automatically determining disease progression in pairs of FDG-PET/CT studies obtained before and after treatment.

Our contributions include 1) a dataset of 1,954 labeled PET/CT imaging examinations using detailed clinical reports for multi-modal model development 2) OncoNet, a 3D deep learning model that performs treatment response assessment at AUROC 0.85 [0.7-0.95] on the internal test set 3) external validation of OncoNet on multi institutional data 4) comparison of OncoNet predictions to clinical scoring produced by a board certified radiologist. We hope our annotated dataset and methodological contributions serve as a stepping stone towards achieving automated, quantitative oncologic imaging evaluation over time with broad implications for cancer care.


2 Methods

2.1 Data

Training Data: This retrospective study was approved by our institutional review board with waived patient consent. Our training dataset consists of 2572 deidentified FDG PET/CT scans from 656 patients leading to 1954 paired longitudinal scans. The exams were administered at Stanford hospital between 2003 and 2010. Each FDG-PET/CT exam also included a free-text unstructured report compiled by board-certified radiologist at the time of the examination. The image resolution of the FDG-PET scan was 128 × 128 pixel, while that of the CT scan, was 512 × 512 pixels.

The exams were split into training (1888), validation (33), and test sets (33). The validation and test sets were sampled at random with uniform probability to capture the class distribution expected in a clinical setting. The exams were split by patient, ensuring that there was no patient overlap between the training, validation, and test sets. The validation set was used to evaluate prototypes during model development, to tune hyperparameters using average AUROC, and for early stopping during model training. All reported metrics were computed on the test set, unless otherwise specified.

The training set annotations were extracted using a rules based heuristic on the radiology reports. Radiology reports contain Findings sections which detail sections of the scan (Head and Neck, Thorax, Abdomen) along with information on lesions identified, measurements and SUV values. We propose to use the maximum SUVs (SUVmax) recorded for lesions in the thorax before and after treatment as weak supervision for OncoNet. If the difference in SUVmax values between the scans is greater than 25% or less than -25%, the label is considered tumor progression and resolution respectively and is stable if between -25% and 25%, in accordance to the Lugano 2014 criteria for tumor evaluation, assessment, and response prediction using 18F-FDG PET/CT

[24]. The validation and test set SUV annotations are determined by a board certified radiologist reviewing the radiology reports. For each exam, the radiologist assigned a single SUV score that corresponded to the most metabolically active lesion reported. The categorical label was determined as above using the Lugano 2014 criteria.

Internal Test Set: Our thorax test set consisted of 46 scans from 13 patients, leading to 33 paired longitudinal scans. The exams were sampled randomly from the original dataset and contain 11 pairs of each of the three classes.

External Test Set: For our external validation we used a public dataset (ACRIN 6668) from The Cancer Imaging Archive which contains studies from 242 patients. The study was conducted as a multicenter trial with the goal of determining whether PET SUV uptake in non small cell lung carcinoma (NSCLC) was a useful predictor of long-term clinical outcome (survival) after definitive chemoradiotherapy. Using the metadata where SUVmax was recorded for the thorax led to a subset of 60 patients. From this subset we filtered out scans where the PET and CT did not take place at the same date or had different number of slices. For the subset of the dataset that we selected, 18 patients saw significant reductions in SUVmax of the hottest lesion categorizing them all as tumor resolution, 1 patient was characterized as tumor progression and 1 patient was characterized as stable. All of the exams demonstrated an improvement based on a 25% reduction in SUVmax values post treatment.

For the longitudinal pertubation experiment, the baseline and follow up scans are reversed and provided as input to the model with the flipped label.

2.2 Network Architecture

OncoNet consists of three main components; Encoder, Decoder, Classifier Head. The Encoder is formed using the Inflated Inception V1 3D convolutional neural network (I3D) pretrained on Kinetics using optical flow

[3]. The final classification layer was removed making the output encoding a 3 dimensional encoding of the input scan. The encoding shape is where is the number of slices in the original exam. The Decoder consists of a soft attention mechanism and a linear classification layer. The soft attention is a dot product between each voxel in the encoded representation and a learned weight matrix . A softmax is applied on the scores computed by the dot product and a linear combination of the voxels is computed. The intuition behind the soft attention is to place higher weight on certain voxels of the exam in a data-driven manner while determining change in tumor burden. This encoder-decoder structure has shown effectiveness in prior work on PET/CT abnormality detection [6].

The encoder and decoder weights are shared during each forward pass for each exam in the pair of exams. The output representations from the decoder are used to compute a difference representation which is a flattened tensor of dimension (hidden size,) which is then passed into a classifier head to determine response to treatment. The classifier head is formed of two linear layers with a ReLU activation function in between.

Figure 1: Network architecture for OncoNet. Each PET/CT exam in a pair is run through the encoder-decoder forward pass with weights shared between the two exams in the pair. The decoder representations are diffed and the resulting representation is passed through a classifier head to determine Progression, Resolution and Stable

2.3 Training Procedure

The model is trained with a batch size of 2 for a maximum of 30 epochs using early stopping based on validation loss. The learning rate used is 0.0001 with an Adam optimizer and a step decay of the learning rate every 10 epochs by a factor of 0.1. Dropout is used as regularization during training in the classifier head. Cross entropy loss is used for supervision and the models are trained using Pytorch 1.4 on 2 NVIDIA TITAN V GPUs.

Data augmentation is also used during training to improve generalization and prevent overfitting. 3 types of data augmentation are used. We randomly crop each slice of the exam to 200 x 200 and upscale to 224 x 224, downsampling the 512 x 512 CT image and upsampling the 128 x 128 PET. We also introduce rotations in all slices of the CT and randomly crop at 200 x 200 and rescale to 224 x 224.

2.4 Model Evaluation

The test set SUV annotations are determined by a board certified radiologist reviewing the radiology reports. For each exam, the radiologist assigned a single SUV score that corresponded to the most metabolically active lesion present in the scan. The categorical labels for tumor response were determined by the percentage difference in SUVmax scores for the paired scans pre and post treatment. If the percentage difference was greater than 25%, the response was determined to be “progression”. If the percentage was less than -25%, the response was determined to be “resolution” and values in between were determined to be “stable”.

Each model was evaluated using AUROC on the predictions compared to the test set labels AUCs were computed for each class and each region individually. F1 scores were also computed. 95% confidence intervals were computed using 1000 bootstrap replicates for average AUROC across classes. The models were also evaluated on an external public dataset from additional institutions in a similar manner. The SUVmax values for the external dataset were derived from the dataset metadata directly and not through radiology reports.

We visualize the model outputs using gradient based Guided Backpropagation saliency maps which compute the gradients of loss with respect to the original pixels in each 2D slice of the exam. These gradients are plotted to demonstrate the pixels that are most sensitive to the model prediction.

2.5 Deauville Score Agreement

A board certified radiologist reviewed each of the exams in the internal Thorax test set and assigned scores based on the Deauville criteria. The Deauville criteria is an internationally accepted scoring system that evaulates FDG avidity of a tumor mass as seen on FDG-PET. The criteria specifies a 5 point scale:

  • Score 1: No uptake above the background

  • Score 2: Uptake mediastinum

  • Score 3: Uptake mediastinum but liver

  • Score 4: Uptake moderately increased compared to the liver at any site

  • Score 5: Uptake markedly increased compared to the liver at any site

To measure agreement we evaluate based on two stratification strategies. 1) We compare OncoNet’s agreement with determination of worse/not worse by mapping the Deauville scores 1,2,3,4 to OncoNet’s resolution and stable classes and the score of 5 to OncoNet’s progression class. 2) We compare OncoNet’s agreement with resolution and progression by mapping the resolution class to Deauville score 1,2,3 and the progression class to Deauville score 5.


3 Results

3.1 Weakly-supervised siamese-style training enables automated treatment response predictions in the thorax

Annotating every lesion in thousands of pairs of CT scans to compare tumor progression is a highly costly and time consuming endeavour for healthcare systems. To allow any healthcare system to train OncoNet, we leverage information recorded in readily-available radiology reports as weak supervision for treatment response. Standard Uptake Values (SUVmax) of metabolically active tumors are extracted from reports through rules based heuristics as described in methods and the differences in the SUVmax values between the scans are categorized as progression, resolution and stable.

OncoNet uses an encoder-decoder architecture previously validated on the task of anatomically-resolved PET/CT abnormality detection [6]

. Like a siamese neural network

[13], OncoNet computes decoder representations from two forward passes (one for each PET/CT scan in the pair). Finally, it computes the elementwise difference between the two representations and feeds it into a final classification layer. As an ablation, we study an approach that computes a difference between the PET/CT imaging exams prior to passing into the encoder-decoder network and runs a single pass through the network. This single pass approach has been studied in prior literature in automated disease progression prediction [2].

We find that OncoNet scores 0.84 [0.95 - 0.75] AUROC on the task of longitudinal change prediction on thorax and outperforms the single pass approach by 0.14 AUROC (p 0.01). From Fig 2 we see that OncoNet scores 0.93 [0.84-1.0] AUROC on Progression, 0.81 [0.56-0.98] on Resolution and 0.78 [0.59-0.95] on Stable. 95th percentile confidence intervals are computed using 1000 bootstrap replicates.

In Figure 3 we produce guided backpropagation saliency maps over each slice of the PET/CT to understand which voxels OncoNet’s classification is most sensitive to. The top row shows an instance of diseases progression where saliency is concentrated on the new tumor in the lower left. The middle row shows a stable tumor nodule. There is no spike in saliency on the stable tumor suggesting that the model is not focused on solely abnormality detection but change in abnormality. Finally, in the bottom row we see that when there is a reduction in the disease from the previous scan, the saliency identifies the region where tumor was resolved.

Figure 2:

ROC curves for OncoNet test performance. (Left) Microaveraged ROC curves over the three classes with the grey bounds indicating 2 standard deviations computed by bootstrap sampling on 1000 samples. (Right) The individual class ROC curves showing that Progression (AUROC: 0.93), Resolution (AUROC:0.81) and Stable (AUROC: 0.78)

AUROC F1 Precision Recall
OncoNet 0.85 [0.7-0.95] 0.70 [0.54-0.85] 0.73 [0.56-0.88] 0.70 [0.55-0.85]
Single Pass 0.67 [0.53-0.82] 0.54 [0.36-0.72] 0.57 [0.37-0.76] 0.55 [0.40-0.72]

Table 1: Class averaged performance metrics along with 95% confidence intervals computed by bootstrap sampling over 1000 samples. OncoNet significantly outperforms the single pass ablation (p0.01)
Figure 3: Examples of Saliency Maps produced by OncoNet on true positives from the test set. The first and third column visualize the PET overlayed on the CT for the baseline and follow up scans. The second and the fourth column visualize the saliencies overlayed on the CT for the baseline and follow up. The top row shows the emergence of a new tumor (progression) in the bottom left part of the scan and the saliency focuses on the region of the change. In the middle row, the same tumor is seen across both scans indicating stable disease. The saliency is faint given that OncoNet focuses on change in disease and not the presence/absence of disease. In the bottom row the tumor present in the baseline regresses and the saliency focuses on the region where the tumor reduced.

3.2 Onconet maintains performance on data from external institutions

We evaluated OncoNet on a public external dataset from The Cancer Imaging Archive contributed by the ACRIN Cooperative Group [18]. Details on inclusion criteria and the dataset are included in methods.

OncoNet maintains internal test set performance by scoring an average AUROC of 0.84 on the external test set. 18 out of 20 exams present in the external test set are tumor resolution, so we propose a method to evaluate generalization of deep learning models for longitudinal analysis by flipping the ordering of scans in the pair. We flip the progression and resolution pairs of exams resulting in a dataset of 18 progression, 1 resolution and 1 stable. We find that OncoNet scores an AUROC of 0.86 on the flipped external test set. We extend this evaluation to our internal test set by flipping the progression and resolution exams. We find that OncoNet scores average AUROC of 0.81 on the perturbed internal test set.

The single pass ablation performs worse on the external set compared to internal by 0.11 AUROC, however maintains performance on the flipped perturbation for both internal and external test sets.See Table 2 for complete metrics.

Data Set Flipped AUROC F1 Precision Recall
OncoNet Internal No 0.85 0.70 0.73 0.70
OncoNet External No 0.84 0.81 0.91 0.75
OncoNet External Yes 0.86 0.92 0.70 0.78
OncoNet Internal Yes 0.81 0.61 0.63 0.64
Single Pass Internal No 0.67 0.54 0.57 0.55
Single Pass External No 0.56 0.59 0.89 0.45
Single Pass External Yes 0.86 0.80 0.86 0.75
Single Pass Internal Yes 0.67 0.54 0.59 0.54

Table 2: Class averaged performance metrics to compare performance across internal and external test sets along with the flipped perturbation

3.3 OncoNet predictions correlate with the Deauville clinical scoring system

A board certified radiologist compared each paired FDG-PET/CT scan in the test set and scored the scans based on the Deauville five point scale. The scoring system is routinely used in clinical practice to quantify treatment response in FDG PET/CT.

A score of 1,2,3 indicates that there was response to treatment with a granular breakdown provided in the methods. A score of 4 could indicate a partial response where the metabolic activity has either remained constant or reduced but not below the level in the liver. A score of 5 indicates a new tumor or increased metabolic activity compared to the previous scan.

We study whether OncoNet’s predictions agree with the radiologist based on two levels of stratification. (1) In clinical practice stratifying a patient in categories 1,2,3,4 vs 5 is important for the radiologist to determine whether patients are either getting worse or not. We evaluate agreement with 5 by using the label corresponding to disease progression. Labels corresponding to resolved or stable would agree with categories 1,2,3,4. (2) Since there isn’t a direct one to one mapping between the “stable” model class and the Deauville scoring system we also evaluate another stratification to study whether using the outputs of “resolution” and “progression” correlate to categories 1,2,3 and category 5 respectively. We selected all the exams where the model predicts “resolved” and “progressed” and of those selected the exams that received scores of 1,2,3,5.

We computed the Cohen’s Kappa agreement between the model outputs and the clinical scoring and found kappa values of (1) 0.80 and (2) 0.73.

3.4 OncoNet can be applied to other anatomical regions in the scan

To determine if OncoNet’s network architecture and training algorithm generalize to other anatomical regions of the exam, we evaluate OncoNet and the single pass ablation on a dataset of abdominal exams along with their associated SUVmax. We find that OncoNet scores 0.8 [0.67-0.90] AUROC on treatment response prediction in the abdomen and pelvis with class-specific performance of 0.85 [0.72-0.97] AUROC on Progression, 0.74 [0.57-0.90] on Resolution and 0.80 [0.63-0.93] on Stable (See Fig 4). Overall, OncoNet outperforms the single pass approach by 0.32 AUROC (p 0.01).

AUROC F1 Precision Recall
OncoNet 0.80 [0.67-0.90] 0.65 [0.49-0.80] 0.67 [0.50-0.82] 0.65 [0.51-0.78]
Single Pass 0.48 [0.29-0.56] 0.23 [0.11-0.36] 0.20 [0.08-0.34] 0.27 [0.14-0.43]

Table 3: Class averaged performance metrics on the abdomen along with 95th percentile confidence intervals computed by bootstrap sampling over 1000 samples
Figure 4: ROC curves for OncoNet test performance on the abdomen. (Left) Microaveraged ROC curves over the three classes with the grey bounds indicating 2 standard deviations computed by bootstrap sampling on 1000 samples. (Right) The class-specific ROC curves for Progression (AUROC: 0.85), Resolution (AUROC:0.74) and Stable (AUROC: 0.80).


4 Discussion

The purpose of this study was to model the task of longitudinal treatment response prediction on multi-slice, multi-modality, multi-class oncologic imaging examinations to achieve automated determination of disease progression, improvement/response, or stability using pairs of FDG PET/CT studies obtained before and after treatment. A siamese-style neural network (OncoNet) classified treatment response between consecutive PET/CT thoracic examinations with AUROC of 0.85 and generalized to an external dataset with AUROC 0.84. In addition, OncoNet achieved AUROC 0.80 when evaluating disease response in the abdomen. When mapping predictions to a common qualitative clinical scoring system for reporting change in disease over time (Deauville) the kappa agreement scores between OncoNet and subspecialist trained radiologist was 0.8, higher than reported literature between expert readers for the same task. Finally the labeled training dataset of 1954 imaging studies is made available to the open research and education communities to further innovate on this important problem.

In the clinical interpretation of medical imaging, comparing consecutive medical image examinations is critical to evaluate disease severity and change over time and especially vital in oncologic imaging and management of therapy. Prior work leveraging convolutional Siamese neural networks was shown to be successful in automated evaluation of disease severity and change over time in consecutive imaging studies in knee radiographs and retinal fundus images; this established the feasibility for comparing paired images from the same patient at two time points using a Siamese neural network allowing continuous measure of change between images without manual localization of the pathology of interest [16]. However this approach was limited as the medical imaging tasks in estimating only a binary output (same vs different) to achieve AUC 0.90 in evaluating knee osteoarthritis change and could accommodate only single 2-dimensional image examinations. To date there have been no published works exploring the automated quantitative comparative analysis of consecutive multi-slice multi-modality imaging examinations (i.e. CT, MR etc) such as the use case in this work with FDG PET/CT scans.

Earlier work toward an end-to-end framework utilizing a weakly supervised approach to lesion detection and localization in PET/CT was found to be capable of excellent performance in leveraging an individual multi-slice imaging examination [6]. While this work represents innovation in automated analysis of FDG PET/CT using deep learning techniques, consideration of only the individual examination, without context to change over time in consecutive studies, ultimately lessens the clinical impact because comparative quantification of disease over time, especially during oncologic therapy, is a chief indication for performing FDG PET/CT imaging. Toward that goal, there have been prior efforts toward automating the quantification of FDG-PET disease progression consisting of largely semi-automated approaches requiring significant manual input to achieve quantification [17]. For example the Auto-PERCIST software, based on traditional rules based software methodologies, can extract quantitative data for relevant imaging pathology (SUVmax, volume, etc) and was shown to reduce inter-reader variability between readers. However this system demands much of the interpreting physician, requiring user input for lesion identification, manual registration of comparison examinations, and human expert localization and selection of the reference tumor as the basis for comparison across studies. Other work used a CNN-based deep-learning approach to achieve automated segmentation of lung tumors in thoracic FDG-PET images based on phantom images or on small datasets without an end-to-end approach [15]. By contrast our approach represents a fully automated end-to-end approach that reports progression, stable disease, or response without requiring any manual input from the human expert and achieving state of the art inter-rater agreement with human experts and is robust to external populations with varying scanner parameters, protocols, etc.

Prior work in automated disease progression have used class activation maps to visualize the progression of disease over multiple days for detecting COVID-19 from CT [20]. Other work on MRI to automatically assess treatment response primarily focused on using deep learning to segment tumor regions and using the intermediate extracted features to correlate to progression [11]. Alternative approaches assigned severity scores to retinopathy imaging that were tracked longitudinally to determine response [23]. Such approaches however need extensive segmentation annotations which are prohibitively expensive for modalities like PET/CT. They are also limited by not assessing the treatment response directly from data in an end to end fashion. OncoNet leverages pretraining for abnormality detection and derives supervision signal directly from associated radiology reports making it highly scalable and unbiased to hand crafted intermediate features.

FDG PET/CT has become indispensable in the routine clinical management of cancer patients and in therapeutic clinical trials [22, 5, 21]. Response to cancer treatment is determined by serial size and SUV measurements of index cancerous lesions seen on PET/CT scans; the percentage of change in the measurements between scans is used to monitor response to therapy and demands standardized and reproducible assessments for meaningful comparisons and conclusions across multiple trials. For example the PET Response Criteria in Solid Tumors 1.0 (PERCIST 1.0) was proposed in 2009 as a method to standardize the assessment of tumor response and includes the assessment of SUVmax on PET [25, 10]. But poor inter-reader agreement using scoring systems across examinations have been widely reported, as low as 0.14 (range 0.14-0.68) under a variety of experimental settings and comparison methods; the agreement rate is likely lower in clinical practice compared to ideal study settings [12, 27, 14]. Such variability is an often cited hurdle to broader utilization of quantitative FDG PET/CT for response assessment especially in examining early treatment response-related changes [4, 26]. While in practice the SUVmax is reasonably easy to determine with many forms of software, and as mentioned above can improve inter-reader variability across scans, but requires manual input. OncoNet demonstrated automated agreement with a board certified radiologist with Cohen’s kappa 0.8 and as the workflow requires no manual input, this approach could be used at scale for a variety of clinical and trial outcome determinations.

Further, given proven value of FDG PET/CT in oncology and consequent continued increased in study volume, OncoNet could be used for rapid communication of high-level results to clinicians and patients, providing timely information regarding interval disease state to inform clinical decision making and easing clinical burdens; cancer patients are routinely asked to return to hospital-based imaging departments for FDG PET/CT imaging and for convenience schedule clinical appointments with an oncologist during the same visit. This leads to significant challenges as the oncologist requires knowledge of the disease state information from the imaging study, but study results may not yet be available during the patient’s visit which leads to increased follow-up calls, calls to the nuclear imaging medical specialist , and patient anxiety waiting for results. And while FDG PET/CT is intrinsically a quantitative imaging technique, in practice assessments of cancer response remain largely qualitative; thus many scoring systems have been developed as, for example, in lymphoma where quantitative PET data are converted into a five-point qualitative scale [7]. We found that OncoNet was at human-expert level agreement in both treatment response in a three point scale (i.e. progression, stable, response). Leveraging the routine use of OncoNet for simplified categorical measures of disease state results may lead to improved consistency and also help address the challenges of patient direct access to medical imaging results records as mandated under the final rule of the 21st Century Cures Act by providing simplified quantitative outcomes measures for tracking oncologic disease over time.

This study includes several important limitations. This is a retrospective study design which comes with well-established shortcomings and inherent limitations. The deep learning model described was developed and trained on data from a single large academic institution and while robust external test evaluation was performed, additional study to comprehensively understand the generalizability of our model is needed to inform the direction of future work. The evaluation of this approach considered only a few tasks of the many use cases for FDG PET/CT, however, the methods and results should be considered when applying to other predictive tasks. Lastly, while our results are promising, delivering production-ready models in their final clinical form is beyond the scope of this study and additional work is needed before deploying such models in clinical practice.


5 Conclusion

In conclusion, this work describes the development of OncoNet as an end-to-end approach for quantitatively determining longitudinal treatment response assessment on multi-slice multi-modality oncologic FDG PET/CT imaging examinations. OncoNet achieved an AUROC of 0.85 on automated determination of disease resolution, stability or progression using pairs of FDG PET/CT studies obtained before and after treatment with robust external validation (AUROC 0.84). OncoNet further achieves agreement with a board certified radiologist with a kappa of 0.8. OncoNet’s methodology and associated annotated dataset are designed to achieve automated quantitative oncologic imaging evaluation over time with potential broad implications for cancer care and contributes to the broader machine learning in healthcare research community.


We would like to acknowledge the GE Blue Sky team (Elizabeth Philps, Omri Ziv, Gil Kovalsky, Melissa Desnoyers, Shai Kremer) for their financial support for this industry-academic collaboration.

Author information.

Matthew P Lungren is a visiting researcher at Microsoft and provides consulting services to Philips, Segmed, Centaur, Bunker Hill, and Nines Radiology; received research funding for this work from GE Healthcare and a research grant from the National Library of Medicine of the NIH.

A. S. Chaudhari has provided consulting services to SkopeMR, Inc., Subtle Medical, Chondrometrics GmbH, Image Analysis Group, Edge Analytics, ICM, and Culvert Engineering; is a shareholder of Subtle Medical, LVIS Corporation, and Brain Key; and receives research support from GE Healthcare and Philips.

The authors declare no conflict of interest. Correspondence should be addressed to anirudhjoshi@cs.stanford.edu.




  • [1] F. Arcadu, F. Benmansour, A. Maunz, J. Willis, Z. Haskova, and M. Prunotto (2019) Deep learning algorithm predicts diabetic retinopathy progression in individual patients. NPJ digital medicine 2 (1), pp. 1–9. Cited by: §1.
  • [2] Y. Arzhaeva, M. Prokop, K. Murphy, E. M. van Rikxoort, P. A. de Jong, H. A. Gietema, M. A. Viergever, and B. van Ginneken (2010) Automated estimation of progression of interstitial lung disease in ct images. Medical physics 37 (1), pp. 63–73. Cited by: §3.1.
  • [3] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6299–6308. Cited by: §2.2.
  • [4] Q. Ding, X. Cheng, L. Yang, Q. Zhang, J. Chen, T. Li, and H. Shi (2014) PET/ct evaluation of response to chemotherapy in non-small cell lung cancer: pet response criteria in solid tumors (percist) versus response evaluation criteria in solid tumors (recist). Journal of thoracic disease 6 (6), pp. 677. Cited by: §1, §4.
  • [5] D. S. Ettinger, D. E. Wood, D. L. Aisner, W. Akerley, J. Bauman, L. R. Chirieac, T. A. D’Amico, M. M. DeCamp, T. J. Dilling, M. Dobelbower, et al. (2017) Non–small cell lung cancer, version 5.2017, nccn clinical practice guidelines in oncology. Journal of the National Comprehensive Cancer Network 15 (4), pp. 504–535. Cited by: §4.
  • [6] S. Eyuboglu, G. Angus, B. N. Patel, A. Pareek, G. Davidzon, J. Long, J. Dunnmon, and M. P. Lungren (2021) Multi-task weak supervision enables anatomically-resolved abnormality detection in whole-body fdg-pet/ct. Nature communications 12 (1), pp. 1–15. Cited by: §1, §2.2, §3.1, §4.
  • [7] A. Gallamini, S. F. Barrington, A. Biggi, S. Chauvie, L. Kostakoglu, M. Gregianin, M. Meignan, G. N. Mikhaeel, A. Loft, J. M. Zaucha, et al. (2014) The predictive role of interim positron emission tomography for hodgkin lymphoma treatment outcome is confirmed using the interpretation criteria of the deauville five-point scale. Haematologica 99 (6), pp. 1107. Cited by: §4.
  • [8] S. Huang, T. Kothari, I. Banerjee, C. Chute, R. L. Ball, N. Borus, A. Huang, B. N. Patel, P. Rajpurkar, J. Irvin, et al. (2020) PENet—a scalable deep-learning model for automated diagnosis of pulmonary embolism using volumetric ct imaging. NPJ digital medicine 3 (1), pp. 1–9. Cited by: §1.
  • [9] J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al. (2019) Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 590–597. Cited by: §1.
  • [10] O. Joo Hyun, M. A. Lodge, and R. L. Wahl (2016) Practical percist: a simplified guide to pet response criteria in solid tumors 1.0. Radiology 280 (2), pp. 576. Cited by: §4.
  • [11] P. Kickingereder, F. Isensee, I. Tursunova, J. Petersen, U. Neuberger, D. Bonekamp, G. Brugnara, M. Schell, T. Kessler, M. Foltyn, et al. (2019) Automated quantitative tumour response assessment of mri in neuro-oncology with artificial neural networks: a multicentre, retrospective study. The Lancet Oncology 20 (5), pp. 728–740. Cited by: §4.
  • [12] R. Kluge, L. Chavdarova, M. Hoffmann, C. Kobe, B. Malkowski, F. Montravers, L. Kurch, T. Georgi, M. Dietlein, W. H. Wallace, et al. (2016) Inter-reader reliability of early fdg-pet/ct response assessment using the deauville scale after 2 cycles of intensive chemotherapy (oepa) in hodgkin’s lymphoma. PloS one 11 (3), pp. e0149072. Cited by: §4.
  • [13] G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §3.1.
  • [14] V. Kumar, K. Nath, C. G. Berman, J. Kim, T. Tanvetyanon, A. A. Chiappori, R. A. Gatenby, R. J. Gillies, and E. A. Eikman (2013) Variance of standardized uptake values for fdg-pet/ct greater in clinical practice than under ideal study settings. Clinical nuclear medicine 38 (3), pp. 175. Cited by: §4.
  • [15] K. Leung, W. Marashdeh, R. Wray, S. Ashrafinia, A. Rahmim, M. Pomper, and A. Jha (2018) A deep-learning-based fully automated segmentation approach to delineate tumors in fdg-pet images of patients with lung cancer. Journal of Nuclear Medicine 59 (supplement 1), pp. 323–323. Cited by: §4.
  • [16] M. D. Li, K. Chang, B. Bearce, C. Y. Chang, A. J. Huang, J. P. Campbell, J. M. Brown, P. Singh, K. V. Hoebel, D. Erdoğmuş, et al. (2020) Siamese neural networks for continuous disease severity evaluation and change detection in medical imaging. NPJ digital medicine 3 (1), pp. 1–9. Cited by: §1, §4.
  • [17] S. J. Lim, H. Wang, J. P. Leal, H. G. Shu, R. L. Wahl, et al. (2021) Quantitation of cancer treatment response by 2-[18 f] fdg pet/ct: multi-center assessment of measurement variability using auto-percist™. EJNMMI research 11 (1), pp. 1–9. Cited by: §4.
  • [18] M. Machtay, F. Duan, B. A. Siegel, B. S. Snyder, J. J. Gorelick, J. S. Reddin, R. Munden, D. W. Johnson, L. H. Wilf, A. DeNittis, et al. (2013) Prediction of survival by [18f] fluorodeoxyglucose positron emission tomography in patients with locally advanced non–small-cell lung cancer undergoing definitive chemoradiation therapy: results of the acrin 6668/rtog 0235 trial. Journal of clinical oncology 31 (30), pp. 3823. Cited by: §3.2.
  • [19] A. Park, C. Chute, P. Rajpurkar, J. Lou, R. L. Ball, K. Shpanskaya, R. Jabarkheel, L. H. Kim, E. McKenna, J. Tseng, et al. (2019) Deep learning–assisted diagnosis of cerebral aneurysms using the headxnet model. JAMA network open 2 (6), pp. e195600–e195600. Cited by: §1.
  • [20] J. Pu, J. K. Leader, A. Bandos, S. Ke, J. Wang, J. Shi, P. Du, Y. Guo, S. E. Wenzel, C. R. Fuhrman, et al. (2021) Automated quantification of covid-19 severity and progression using chest ct images. European Radiology 31 (1), pp. 436–446. Cited by: §4.
  • [21] J. G. Ravenel, K. E. Rosenzweig, J. Kirsch, M. E. Ginsburg, J. P. Kanne, L. L. Kestin, J. A. Parker, A. Rimner, A. G. Saleh, and T. H. Mohammed (2014) ACR appropriateness criteria non-invasive clinical staging of bronchogenic carcinoma. Journal of the American College of Radiology 11 (9), pp. 849–856. Cited by: §4.
  • [22] G. A. Silvestri, A. V. Gonzalez, M. A. Jantz, M. L. Margolis, M. K. Gould, L. T. Tanoue, L. J. Harris, and F. C. Detterbeck (2013) Methods for staging non-small cell lung cancer: diagnosis and management of lung cancer: american college of chest physicians evidence-based clinical practice guidelines. Chest 143 (5), pp. e211S–e250S. Cited by: §4.
  • [23] S. Taylor, J. M. Brown, K. Gupta, J. P. Campbell, S. Ostmo, R. P. Chan, J. Dy, D. Erdogmus, S. Ioannidis, S. J. Kim, et al. (2019) Monitoring disease progression with a quantitative severity scale for retinopathy of prematurity using deep learning. JAMA ophthalmology 137 (9), pp. 1022–1028. Cited by: §4.
  • [24] R. L. Van Heertum, R. Scarimbolo, J. G. Wolodzko, B. Klencke, R. Messmann, F. Tunc, L. Sokol, R. Agarwal, J. A. Strafaci, and M. O’Neal (2017) Lugano 2014 criteria for assessing fdg-pet/ct in lymphoma: an operational approach for clinical trials. Drug design, development and therapy 11, pp. 1719. Cited by: §2.1.
  • [25] R. L. Wahl, H. Jacene, Y. Kasamon, and M. A. Lodge (2009) From recist to percist: evolving considerations for pet response criteria in solid tumors. Journal of nuclear medicine 50 (Suppl 1), pp. 122S–150S. Cited by: §4.
  • [26] R. L. Wahl, K. Zasadny, M. Helvie, G. D. Hutchins, B. Weber, and R. Cody (1993) Metabolic monitoring of breast cancer chemohormonotherapy using positron emission tomography: initial evaluation.. Journal of Clinical Oncology 11 (11), pp. 2101–2111. Cited by: §4.
  • [27] A. Weisman, I. Lee, H. Im, K. McCarten, S. Kessel, C. Schwartz, K. Kelly, V. Santoro-Fernandes, R. Jeraj, S. Cho, et al. (2020) Machine learning-based assignment of deauville scores is comparable to interobserver variability on interim fdg pet/ct images of pediatric lymphoma patients. Journal of Nuclear Medicine 61 (supplement 1), pp. 1434–1434. Cited by: §4.