Pulmonary nodule management strategy influences the cost-effectiveness of a lung cancer screening program . It remains difficult to differentiate high-risk nodules from low-risk ones based on morphologic characteristics . In order to help radiologists and clinicians to make precise clinical decision for each patient, researchers have made several categorical management recommendation and scoring systems according to morphology, diameters or volume in recent years, e.g., NCCN , Fleischner , Lung-RADS . However, tumor growth is such a complicated progress that more advanced strategies are worth exploring to facilitate precision medicine. Emerging deep learning technology suggests a potential alternative to develop end-to-end lung nodule management system in a data-driven fashion. Although numerous studies [5, 17, 18] have explored end-to-end approaches to predict malignancy scores of lung nodules, while only a few studies [1, 4] address the lung nodule follow-up problem. Nevertheless, these studies only provide black-box predictions without intuitive explanations. There is also study  on predicting tumor growth with a model-free appearance modeling approach using a probabilistic U-Net , however it could not provide any quantitative assessment on the risk of tumors.
In this study, we aim at a unified approach to predict growth of lung nodules, with both high-quality visual appearances and accurate quantitative malignancy scores. The core of our approach is based on a WarpNet, predicting displacement field (or motion ) on a future volume from a baseline volume. With the field , we could obtain not only the predicted visual appearance of the future volume by warpping the baseline, but also the feature segmentation mask from the baseline mask, which could be used for quantitative assessment of tumor growth. This approach is inspired from VoxelMorph 
, where the displacement field for registration is conditional on both the baseline and future volumes; instead, our predictive displacement field is conditional only on the baseline volume and could be dynamically estimated. Moreover, a TextureNet is designed to refine textural details of the outputs from WarpNet. We introduce techniques including Temporal Encoding Module and Warp Segmentation Loss to encourage time-aware and malignancy-aware representation learning. The whole network, named Nodule Follow-Up Prediction Network (NoFoNet), establishes a unified framework to produce both high-quality visual appearances and accurate quantitative malignancy assessment for lung nodule follow-up. Our in-house follow-up dataset from two medical centers validates the effectiveness of NoFoNet.
2 Materials and Methods
2.1 Task Formalization and Dataset
We aim at a unified framework to predict future volume of a lung nodule, given any time interval and a baseline volume. An in-house dataset is collected, containing 622 LDCT scans from 246 patients (114 males and 132 females) with a total of 315 long-standing pulmonary nodules. Each patient has at least two time points of thin layer LDCT (slice thickness 1.25mm), with the time interval of 30-1351, 136 days (min-max, median). We select nodules at every two time points as a sample (for example if a nodule has 3 follow-up scans at time points , we choose time points , , as 3 samples), resulting in 731 pairs. The age of the patients at first examination is 23-97, 62 years. The segmentation VOI of each selected nodule (diameter from 3mm to 30mm) is delineated by an expert radiologist and checked by another.
We pre-process the data as follows : CT scans are resampled isotropically into . The voxel intensity is normalized to from the Hounsfield unit (HU), using the mapping function . Each data sample is a cubic volume image with the size of , which covers the size of all nodules in our study.
2.2 NoFoNet: Nodule Follow-Up Prediction Network
To model the growth of nodules, we develop a Nodule Follow-Up Prediction Network (NoFoNet, see Fig. 1) consisting of a WarpNet and a TextureNet for spatial and texture (intensity) transformations  respectively, where an integrated temporal encoding module (TEM) is addressed to encode different follow-up time interval information into the lesion representation. As we will show later, the WarpNet and TextureNet are able to model the shape and texture variation of nodule growth well.
Given a pair of follow-up input and target images111If no otherwise specified, image mentioned here and later in this article refers to cubic volume image with a nodule in the center. with time interval , each of which has corresponding nodule segmentation map and , the WarpNet with parameter first predicts a smooth voxel-wise displacement field for spatial transformation. Following the registration literature , we have the warp function , where id is identity function. We apply the warp function to to get the warped image , and denote this as . Similarly, the warped segmentation map . The TextureNet with parameter takes the concatenation of and as inputs and generates a voxel-wise residual . Then we get the results , where denotes the residual cropped by warped segmentation. The overall formulation of our NoFoNet is as follows:
2.3 Temporal Encoding Module (TEM)
Since the time interval between two follow-up scans can be rather different, inspired by positional encoding  we develop a Temporal Encoding Module (TEM) to embed time interval information into the prediction model. Due to the limitation of dataset size, we discretize the interval using time mapping function with an upper cut-off value 20, for most of the intervals are less than 600 days. Sine and cosine functions with different frequencies are used in the TEM to encode temporal information:
where is the discretized time interval, is the total number of channels of the encoded feature map and
is the dimension. That is, the even/odd dimensions of the temporal encoding are generated by sin/cos function with different wavelengths (to ), which makes the relative time information encoded in a redundant way. Besides, the value range of the encoding result is within a certain numerical interval due to the boundedness of sinusoid. These two points ensure that the temporal encoding method can generate a more meaningful high-dimensional representation space.
2.4 WarpNet for Spatial Transformation
As the core of our method, WarpNet predicts a displacement field to model the shape variation of nodule growth, which is similar to the motion prediction in video tasks [6, 8]. The architecture of WarpNet is based on a CNN similar to U-Net  with skip connections, and the temporal encoding from TEM is connected to the bottom of WarpNet.
The loss function for training WarpNethas four terms: similarity loss between warped images and target images , segmentation loss between warped segmentation maps and target maps , smoothness loss for the deformation field and regularization loss for the output of WarpNet when . In summary, the learning of WarpNet is formulated as:
with weights , where is the predicted spatial warp function when time interval . All loss functions are designed as follows:
a) similarity loss and regularization loss: In our experiments we find that for spatial transformation normalized cross correlation (NCC) loss leads to more reasonable and robust results than MSE loss. The NCC loss between warped image / target image and the regularization loss for is defined as:
b) segmentation loss: We use Dice loss to constrain the similarity between warped segmentation mask and target mask :
c) smoothness loss: Considering that the contour of nodule changes continuously as it grows, we use a diffusion regularization loss to encourage the smoothness of displacement field :
where finite differences between neighboring voxels are used to approximate the spatial gradients (for x,y,z 3 dimensions).
2.5 TextureNet for Texture Transformation
In addition to the shape variation, there is also a texture variation in nodule growth caused by the change of CT value distribution of nodules. So a TextureNet is needed to estimate the residual between warped image and target image . TextureNet follows the architecture of WarpNet. To train TextureNet we need an intensity similarity loss between textured images (see Eq. 1) and target images , and a regularization loss for the predicted residual when . So the texture transformation learning is formulated as:
with weight . We choose MSE loss to encourage maximal intensity similarity. The loss functions of TextureNet are defined as:
2.6 Implementation Details
NoFoNet can use any CNN architecture for WarpNet and TextureNet, and we use the network design of 
in this work. All of the experiments in this study are implemented on an NVIDA Titan X GPU and an Intel i7-6700 CPU. Our codes are based on Python 3.7.3 and Pytorch-1.2.0. We use and for the loss weights in Eq. 3 and Eq. 7. Online data augmentation methods, including rotation and flipping along a random axis, are applied on the input images. Each part of NoFoNet is trained using Adam optimizer 
with an initial learning rate of 0.001 for 200 epochs. Specifically, we emphasize the similarity loss inside the segmentation map to put more attention on the nodule.
3.1 Evaluation Protocol
|+Warp Seg Loss||18.1952||43.2464||0.6474||0.8594||0.8805||0.8699|
Our NoFoNet is trained to predict what the nodule may be visually like after a certain time interval, then we can determine whether it is a PD (progressive disease, i.e., significant growth of nodule in size) case. Since some nodules in our dataset have multiple follow-ups, we stipulate that a nodule is judged as a PD case as long as one of its follow-up pairs (see Sec. 2.1) meets specific criterion, which is determined with the help of two senior radiologists.
Define , () as the two nodule volumes of a follow-up pair with time interval (), the criterion is as follows: (1) Considering that the fast-growing nodules have higher risks, we calculate the average volume growth rate , and set a threshold of 1; (2) Some cases may have less than 1 but eventually grow significantly in size, we set a threshold of 200 for volume difference and a threshold of for relative volume difference
. In summary, a nodule is classified as PD case if one of its observed, or and .
The 315 nodules are divided into two parts according to the aforementioned criterion, resulting in 64 positive cases (PD, significant growth) and 251 negative cases (non-PD, stable or shrinking). We split our dataset randomly into 5 groups based on patients and perform 5-fold cross validation to evaluate our models.
3.2 Performance Analysis
In this section we will present some quantitative results and qualitative results. Table 1 shows the performance of our models and baselines using 5-fold cross validation method. Note that U-Net w or w/o TEM predicts output images directly so it only has PSNR and
(PSNR in the nodule parts) for output/target images. As is shown in Appendix Fig. A.1, U-Net baselines generate predicted images with low visual quality. It is noticeable that when added segmentation loss for warped/target images, WarpNet predicts more accurate displacement fields, resulting in higher dice coefficient between warped/target images and better performance for PD/non-PD classification than WarpNet without segmentation loss. We use the geometrical mean (G-mean) of sensitivity (TP/TP+FN) and specificity (TN/TN+FP) as main evaluation index for the unbalanced dataset. The TextureNet inNoFoNet improves the visual quality of the warped images and achieves higher PSNR/ scores, as visually shown in Fig. 2.
Fig. 2 shows the results of spatial transformation for input images by WarpNet and voxel-wise texture addition for warped images by TextureNet. We select a PD case and a non-PD case to demonstrate the performance of NoFoNet on different types of nodules. It can be seen that TextureNet is able to refine the warped images from WarpNet and increase the intensity similarity between the predicted and target nodules. Please refer to Appendix Fig. A.1 for more comparison results (including results from U-Net).
Fig. 3 illustrates the continuous prediction results of two nodules using WarpNet. Note that results with the same follow-up time interval as the targets are highlighted in red. We choose a PD (progressive disease) case and a non-PD case for contrast to show that our WarpNet can represent both significant growth and stabilization of nodules in size well. For PD case it can also be seen that the model is able to generate reasonable nodules as time interval changes and the variation tendency is plausible, indicating the effectiveness of TEM.
We develop the NoFoNet, a unified network to predict follow-up lung nodules. By explicitly learning spatial transformation and texture transformation, it yields high-quality visual appearances and accurate malignancy scores, with validated effectiveness on an in-house dataset from two clinical centers.
A limitation of this study is that we only model the tumor size as the indicator of malignancy. However, according to TNM tumor staging system, tumor size (T), lymph node (N) and metastasis (M) are considered in tumor prognosis assessment. In future studies, we will address the N and M information to develop a more advanced risk stratification system for lung nodule follow-up.
-  Ardila, D., Kiraly, A.P., Bharadwaj, S., Choi, B., Reicher, J.J., Peng, L., Tse, D., Etemadi, M., Ye, W., Corrado, G., Naidich, D.P., Shetty, S.: End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine 25, 954–961 (2019)
-  Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Voxelmorph: a learning framework for deformable medical image registration. IEEE transactions on medical imaging 38(8), 1788–1800 (2019)
-  Cressman, S., Lam, S.C.T., Tammemagi, M.C., Evans, W.K., Leighl, N.B., Regier, D.A., Bolbocean, C., Shepherd, F.A., Tsao, M.S., Manos, D., Liu, G., Atkar-Khattra, S., Cromwell, I., Johnston, M.R., Mayo, J.R., Mcwilliams, A., Couture, C., English, J.S.C., Goffin, J.R., Hwang, D.M., Puksa, S., Roberts, H., Tremblay, A., Maceachern, P., Burrowes, P., Bhatia, R., Finley, R.J., Goss, G.D., Nicholas, G., Seely, J.M., Sekhon, H.S., Yee, J., Amjadi, K., Cutz, J.C., Ionescu, D.N., Yasufuku, K., Martel, S., Soghrati, K., Sin, D.D., Tan, W.C., Urbański, S., Xu, Z., Peacock, S.J.: Resource utilization and costs during the initial years of lung cancer screening with computed tomography in canada. In: Journal of thoracic oncology (2014)
-  Huang, P., Lin, C.T., Li, Y., Tammemagi, M.C., Brock, M.V., Atkar-Khattra, S., Xu, Y., Hu, P., Mayo, J.R., Schmidt, H., Gingras, M., Pasian, S., Stewart, L., Tsai, S.S.H., Seely, J.M., Manos, D., Burrowes, P., Bhatia, R., Tsao, M.S., Lam, S.: Prediction of lung cancer risk at follow-up screening with low-dose ct: a training and validation study of a deep learning method. The Lancet Digital Health, (2019)
-  Hussein, S., Cao, K., Song, Q., Bagci, U.: Risk stratification of lung nodules using 3d cnn-based multi-task learning. In: IPMI (2017)
-  Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: NIPS. pp. 6915–6924 (2017)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: ICCV. pp. 648–657 (2017)
-  MacMahon, H., Naidich, D.P., Goo, J.M., Lee, K.S., Leung, A.N.C., Mayo, J.R., Mehta, A.C., Ohno, Y., Powell, C.A., Prokop, M., Rubin, G.D., Schaefer-Prokop, C., Travis, W.D., Schil, P.E.Y.V., Bankier, A.A.: Guidelines for management of incidental pulmonary nodules detected on ct images: From the fleischner society 2017. Radiology 284 1, 228–243 (2017)
-  Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
-  Petersen, J., Jäger, P.F., Isensee, F., Kohl, S.A., Neuberger, U., Wick, W., Debus, J., Heiland, S., Bendszus, M., Kickingereder, P., et al.: Deep probabilistic modeling of glioma growth. In: MICCAI. pp. 806–814. Springer (2019)
-  Pinsky, P.F., Gierada, D.S., Black, W.L., Munden, R.F., Nath, H., Aberle, D.R., Kazerooni, E.A.: Performance of lung-rads in the national lung screening trial. Annals of Internal Medicine 162, 485–491 (2015)
-  Pinsky, P.F., Gierada, D.S., Nath, P., Kazerooni, E.A., Amorosa, J.: National lung screening trial: variability in nodule detection rates in chest ct studies. Radiology 268 3, 865–73 (2013)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI. pp. 234–241. Springer (2015)
-  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NIPS. pp. 5998–6008 (2017)
-  Wood, D.E.: National comprehensive cancer network (nccn) clinical practice guidelines for lung cancer screening. Thoracic surgery clinics 25 2, 185–97 (2015)
-  Xie, Y., Xia, Y., Zhang, J., Feng, D.D., Fulham, M.J., Cai, T.W.: Transferable multi-model ensemble for benign-malignant lung nodule classification on chest ct. In: MICCAI (2017)
-  Yang, J., Fang, R., Ni, B., Li, Y., Xu, Y., Li, L.: Probabilistic radiomics: Ambiguous diagnosis with controllable shape analysis. In: MICCAI. pp. 658–666. Springer (2019)
-  Zhao, A., Balakrishnan, G., Durand, F., Guttag, J.V., Dalca, A.V.: Data augmentation using learned transformations for one-shot medical image segmentation. In: CVPR. pp. 8543–8553 (2019)