Spatial-Temporal Convolutional LSTMs for Tumor Growth Prediction by Learning 4D Longitudinal Patient Data

02/23/2019 ∙ by Ling Zhang, et al. ∙ NetEase, Inc 0

Prognostic tumor growth modeling via medical imaging observations is a challenging yet important problem in precision and predictive medicine. Traditionally, this problem is tackled through mathematical modeling and evaluated using relatively small patient datasets. Recent advances of convolutional networks (ConvNets) have demonstrated their higher accuracy than mathematical models in predicting future tumor volumes. This indicates that deep learning may have great potentials on addressing such problem. The state-of-the-art work models the cell invasion and mass-effect of tumor growth by training separate ConvNets on 2D image patches. Nevertheless such a 2D modeling approach cannot make full use of the spatial-temporal imaging context of the tumor's longitudinal 4D (3D + time) patient data. Moreover, previous methods are incapable to predict clinically-relevant tumor properties, other than the tumor volumes. In this paper, we exploit to formulate the tumor growth process through convolutional LSTMs (ConvLSTM) that extract tumor's static imaging appearances and simultaneously capture its temporal dynamic changes within a single network. We extend ConvLSTM into the spatial-temporal domain (ST-ConvLSTM) by jointly learning the inter-slice 3D contexts and the longitudinal dynamics. Our approach can incorporate other non-imaging patient information in an end-to-end trainable manner. Experiments are conducted on the largest 4D longitudinal tumor dataset of 33 patients to date. Results validate that the proposed ST-ConvLSTM model produces a Dice score of 83.2 11.2 compared methods of traditional linear model, ConvLSTM, and generative adversarial network (GAN) under the metric of predicting future tumor volumes. Last, our new method enables the prediction of both cell density and CT intensity numbers.



There are no comments yet.


page 1

page 3

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Tumor growth modeling using medical images of longitudinal studies is a challenging yet important problem in precision and predictive medicine because it may potentially lead to better tumor treatment management and surgical planning for patients. Conventionally, this task has been well exploited through complex and sophisticated mathematical modeling [1, 2, 3, 4, 5, 6, 7], which accounts for both cell invasion and mass-effect using reaction-diffusion equations and bio-mechanical models. From there the actual tumor growth can be predicted by personalizing the established model based on clinical imaging derived tumor physiological parameters, such as morphology, metabolic rate, and cell density. While these methods yield informative results, most of them have not been able to utilize the underlying statistical distributions of tumor growth patterns in the studied patient population. The number of mathematical model parameters is often very limited (e.g., 5 in [6]), which might not be sufficient to model the inherent complexities of the growing tumors.

Furthermore, two alternative approaches have been proposed to predict tumor growth. 1) Assuming that the future tumor growth pattern follows its past trend, optical flow computing can be used to estimate previous voxel-level tumor motions, and subsequently, to predict the future deformation field via an autoregressive model


. Therefore the entire future brain MR scan can be generated but the resulting tumor volume still needs to be measured manually. This approach may also over-simplify the essential challenge because it only infers the future tumor imaging in a linear way. 2) To address this issue, machine learning principle is a potential solution to incorporate the population trend into tumor growth modeling. The pioneer study


attempts to model the glioma growth patterns as a pixel classification problem where traditional machine learning pipeline of hand-crafted feature extraction and selection and classifier training is applied. However, conventional statistical techniques used in this study is not capable to achieve satisfying prediction accuracy results, on the complex task of tumor growth prediction (e.g., both precision and recall are 59.8% in predicting the glioma growth


Recently, statistical and deep learning framework [10]

and two-stream convolutional neural networks (ConvNets)

[11] have shown more compelling and improved performance than the mathematical modeling approach [6] using the same pancreatic tumor dataset. More importantly, the later study [11] demonstrate the effectiveness of deep ConvNets in characterizing two fundamental processes of both cell invasion and mass-effect of tumor growth.

From [10]

, image patch based ConvNets extract deep image features that are late-fused with clinical factors, followed by a support vector machine (SVM) classifier using all features. Such a separated process may not fully exploit the inherent correlations between the deep image features and clinical factors. The two-stream ConvNet architecture

[11] treats the prediction as a local patch-based classification task, which does not consider the global information of the tumor structure and its surrounding spatial-temporal context. Both methods make predictions based on 2D image slices whereas the tumor growth modeling is in fact a “3D+time” problem. Additionally, [10, 11] cannot predict other clinically relevant properties, such as tumor cell density and radiodensity in Hounsfield units (HU). Last, due to the difficulties in collecting the longitudinal tumor data and the complexities of data preprocessing, both studies are only conducted and evaluated using a relatively small dataset consisting of ten patients.

In this paper, we propose a novel deep learning approach that incorporates both 3D spatial and temporal image properties and clinical information into one single deep neural network. Our main contributions are summarized as follows. (1) A novel spatial-temporal Convolutional Long Short-Term Memory (ST-ConvLSTM) network is proposed to jointly learn the intra-slice spatial structures, the inter-slice correlations in 3D contexts, and the temporal dynamics in time sequences. (2) Compared to previous machine (deep) learning based methods

[9, 10, 11] that utilize 2D image patches and predict the future tumor volume only, our new model is holistic image-based and enables the predictions of future tumor imaging properties, i.e., future cell density and CT intensity numbers for relevant clinical diagnosis. (3) Other clinical information, such as time intervals can be fully integrated into our end-to-end trainable deep learning framework. (4) To the best of our knowledge, we construct the largest longitudinal pancreatic tumor growth dataset (33 patients) to date, more than 3 times larger than previous state-of-the-art work [11, 10].

Ii Related Work

In recent computer vision developments, the task of future image frame prediction (i.e., predicting a visual pattern of RGB raw pixels given a short video sequence) has attracted great research interests

[12, 13, 14, 15, 16, 17]. It is closely related to unsupervised feature learning and can enable intelligent agents to react to the environments. Table I briefly summarizes recent representative deep learning based approaches to tackle this problem. There are mainly four key technique components being exploited: convolutional LSTMs (ConvLSTM), generative adversarial network (GAN), encoder-decoder network, and motion (mostly optical flow) cues.

ConvLSTM GAN Encoder-Decoder Motion
LSTM [18] - - -
ConvLSTM [19] - - -
BeyondMSE [20] - - -
Autoencoder [21] -
CDNA[22] - -
MCNet [12] -
PredNet [23] - - -
STNet [13]
VPN [14] - -
Hierarchical [24] - - -
S2S-GAN [25] - - -
DVF [26] - -
DM-GAN [15]
PredRNN [16] - - -
SNCCL [27] - - -
Two-stream [28] - -
Spatial-motion [17]
TABLE I: Deep Learning Based Future Image Frame Prediction Methods and Their Key Techniques. ConvLSTM: Convolutional Long Short-Term Memory; GAN: Generative Adversarial Network.

LSTM [29] is designed for the next time-step status prediction in a temporal sequence, and can be naturally extended to predict the consequent frames from previous ones in a video [18]. Next, ConvLSTM [19] is proposed to preserve the spatial structure in both the input-to-state and state-to-state transitions. Subsequently, ConvLSTM becomes the backbone model of several video prediction approaches [21, 22, 12, 23, 13, 14, 15, 16, 17], where each work is enhanced with additional improvements. For example, 1) optical flow is introduced in an encoder-ConvLSTM-decoder framework [21] to explicitly model the temporal dynamics; 2) ConvLSTM is reformulated to predict motions from the current pixels to the next pixels [22]

with the goal of alleviating the blurry prediction images; 3) ConvLSTM is integrated in encoder-decoder networks to estimate the discrete joint distributions of the RGB pixels which archived the highest accuracy on the moving digits dataset

[14]; 4) additionally, a new spatiotemporal LSTM unit [16] is designed to memorize both temporal and spatial representations, thus obtaining better performances than the conventional LSTM.

In addition to ConvLSTM, ConvNets integrated with GAN [20, 25, 27] based image generators represent the other thread of promising solutions, especially effective on sharpening blurry image predictions. Encoder-decoder networks [21, 12, 13, 14, 24, 26, 15] commonly serve as backbone deep learning architectures to achieve the image-to-image prediction that typically contain multiple convolutional layers for subsampling and several deconvolutional layers for upsampling. Comprehensive discussions of the above techniques are given in [13, 15, 17], where state-of-the-art quantitative performances are presented using video, vehicle and pedestrian datasets.

Fig. 1: Image processing pipeline of constructing the tumor dataset for one time point.

ConvLSTM has also been employed for 3D medical image segmentation, and is an effective way of treating the 3D volume as a sequence of 2D consecutive slices [30, 31, 32]. Compared to the 2D ConvNets-based segmentation, ConvLSTM tends to be more robust and consistent inter-slice wise since 3D contextual information is memorized and propagated in the -direction.

Beyond the problems of 3D medical image segmentation (directly on 3D volumetric data scans) and natural video prediction (using 2D image+time sequences), tumor growth prediction is processed on 4D longitudinal volumetric patient imaging scans. Desirable prediction models should not only recall the temporal evolution trend, but also keep consistent with the tumor’s 3D spatial contexts. Motivated by this assumption, we propose a novel spatial-temporal Convolutional Long Short-Term Memory (ST-ConvLSTM) network to explicitly capture their dependencies among 2D image slices, through the recurrent analysis over spatial and temporal dimensions concurrently.

Iii Methods

Iii-a Construction of 4D Longitudinal Tumor Dataset

Our 4D longitudinal tumor imaging data set used in this study consists of dual-phase contrast-enhanced CT volumes at three time points for each patient. As shown in Fig. 1, for each time point, the pre- and post-contrast (arterial phase) 3D CT images are first registered using the ITK111 implementation of mutual information based B-spline registration [33]. The segmentation is performed manually by a medical trainee using ITK-SNAP [34]222 on the post-contrast CT (as those tumors can be better evaluated in the arterial phase), under supervision of an experienced radiologist. Three image feature channels are derived: 1) intracellular volume fraction (ICVF) images representing the cell density that is normalized between [0 100] (more details about ICVF calculation can be referred to [5]

); 2) post-contrast CT images in soft-tissue window [-100, 200HU] and linearly transformed to [0 255]; 3) binary tumor segmentation mask (0 or 255). A sequence of image patches of 32

32 pixels333Most pancreatic tumors in our dataset are 3 cm (30 pixels) in diameter. centered on the 3D tumor centroid is cropped to cover the entire tumor. The cropping is repeated for the three ICVF-CT-Mask channels (right panel in Fig. 1) and forms an RGB image as illustrated in Fig. 2. The dataset is prepared for every tumor volume at each time point, and imaging volumes at different times are aligned using the segmented 3D tumor centroids, to build the spatial-temporal sequence data set for training and testing.

Iii-B Spatial-Temporal Convolutional LSTM

Iii-B1 Convolutional LSTM

LSTM [18] operates on temporal sequences of 1D vectors, and can reconstruct the input sequences or predict the future sequences. A LSTM unit contains a memory cell , an input gate , a forget gate , an output gate , and an output state . Compared with the conventional LSTM, ConvLSTM is capable of modeling 2D spatio-temporal image sequences by explicitly encoding their 2D spatial structures (replacing LSTM’s fully connected transformations with spatial local convolutions in ConvLSTM) into the temporal domain [19, 31]. The main equations of ConvLSTM are as follows:


where and are the sigmoid and hyperbolic tangent non-linearities, is the convolution operator, and is the Hadamard product. The input , cell , hidden states , and gates , , ,

are all 3D tensors with the dimension of

(rows, columns, feature maps). The memory cell is the key module, which acts as an accumulator of the state information controlled by the gates.

Fig. 2: Left: The proposed Spatial-Temporal Convolutional LSTM (ST-ConvLSTM, or ST-CLSTM) network for learning 4D longitudinal data to predict tumor growth. In this example, 3 time points (each with 4 spatially adjacent image slices and each slice is a 3-channel color image) are shown. Right: The ST-ConvLSTM unit.

Iii-B2 ST-ConvLSTM Network and Unit

Given the ICVF-CT-Mask maps at time1 and time2, the aim is to predict the ICVF-CT-Mask maps at time3 (Fig. 2). Directly using ConvLSTM over temporal domain could discover the tumor 2D dynamics for its growth prediction. Furthermore, the spatial consistency in the 3D volume data and its form of sequential nature of 2D image slices make it possible to extend ConvLSTM to the 3D spatial domain.

Instead of simply concatenating the spatial 2D slices, to simultaneously learn the spatial consistency patterns of neighboring image slices and the temporal dynamics across different time points, we propose a new Spatial-Temporal Convolutional LSTM (ST-ConvLSTM) network as illustrated in Fig. 2 (left panel). In this network, each ST-ConvLSTM unit takes input from one image slice at one time point in the 4D space, and receives the hidden states from both the horizontal (the same slice locations at previous time) and vertical directions (previous adjacent slice at the current time). For example, the unit in Fig. 2 (left panel) corresponds to the slice at time , and receives the hidden states from unit and from unit . Along with the current input image slice , the ST-CLSTM unit can predict the future slice and generate its hidden state . In each ST-CLSTM unit (right in Fig. 2), since there are two different candidates generated from the spatial and temporal domains, respectively, two forget gates and are equipped for adding them to update the unit state. The activations of a ST-ConvLSTM at are as follows:


where the input , cell , hidden states and , and gates , , , , are all 3D tensors with dimensions of (rows, columns, feature maps).

The unit of ST-ConvLSTM (1,1) does not have any preceding units in both the spatial and temporal directions, and units at time 1 level do not have the preceding units in their temporal direction. Zeros activations are fed into these units. The output hidden state of the last unit at time 1 level carries all the tumor information at time 1, thus bringing the global contexts to time 2 through the link connecting itself and the first unit at time 2. It is worth mentioning that the ST-ConvLSTM network is flexible that it can be easily extended to receive more numbers of input time points or to predict longer future steps by recursively applying the model.

Fig. 3: The end-to-end network architecture of our proposed encoder-ST-ConvLSTM-decoder for tumor growth prediction.

Iii-B3 End-to-End Architecture

We embed the ST-ConvLSTM unit in the encoder-decoder architecture [22, 14] to make the end-to-end predictions, as shown in Fig. 3 to replace the ST-ConvLSTM unit in Fig. 2. Specifically, each frame in the 4D spatial-temporal space is recurrently passed into the encoder which consists of four convolutional layers to encode a feature map. Along with the image features, clinical factors have non-neglectful influences on predicting the future image as well. We integrate the related factors into our model by spatially tiling the factors (i.e., -dim vector) as a feature map with -channels (=1 in this paper, where only the time interval is added), which is then concatenated to the output of which possesses the smallest number of channels. The concatenated feature map is then fed into a standard ST-ConvLSTM unit (Fig. 2) with a 33 kernel and 8 hidden states for the spatial-temporal modeling. As such, the ST-ConvLSTM determines the future state by jointly considering or integrating the compact spatial information of the current slice, the states of slices from previous times and adjacent locations, and clinically relevant factor(s). After that, the decoder with four deconvolutional layers generates the future frame . Because having a smaller transitional kernel helps ConvLSTM to capture smaller motions [19], we use a 33 convolutional operator by taking into account the knowledge prior that the pancreatic tumor in our dataset is slow-growing.

Iii-B4 Network Training and Testing

During training, tumor image slices from time 1 and time 2 are fed as inputs into our network according to their corresponding spatial-temporal locations. Image slices from time 2 and time 3 are used to compute training loss. The objective function of our ST-ConvLSTM network is designed to minimize the loss between the predicted frames and the true future frames at time 2 and time 3 (other losses, such as and GDL [20] have been tried, but produces empirically better results in our preliminary experiment):


where is the spatial sub-sequence length (set to 5 in our current method).

In testing, each spatial sequence (at time 1 and time 2) is divided to several sub-sequences, and fed into our model to generate predictions for time 3. These sub-sequences can be either overlapping or non-overlapping. In our preliminary experiment, no significantly differences are ever observed, so we use the non-overlapping sub-sequences for efficiency. In addition, our model is flexible to be extended to make prediction at later time points based on time 1 and time 2, e.g., predicting time 4, by directly setting the value of clinical factor (as depicted as the “factors” in Fig. 3) as the time interval between time 2 and time 4.

Iv Experiments

Iv-a Data and Protocol

Dataset: Thirty-three patients (thirteen males and twenty females; each with a pancreatic neuroendocrine tumor, PNET) are collected from the von Hippel-Lindau (VHL) clinical trial at the National Institutes of Health. Generally, these tumors are not surgically treated until they reach 3 cm in diameter. In our dataset, each patient has at least three time points (eleven of these patients have the 4th time points) of dual-phase contrast-enhanced CT imaging, with the time interval of 39890 days (averagestd). The CT voxel sizes range between mm mm. The average age of patients and average volume of tumors at time 1 are 5011 years and 1.71.7 cm, respectively. The average information of all 33 patients is shown in Table II. These tumors keep growing in general, but the growth speed is lower in the 2nd-3rd time period. From the 1st to 2nd time points, only one tumor shrinks. Such a number changes to twelve from the 2nd to 3rd time points.

1st-2nd 2nd-3rd
Days Growth (%) Days Growth (%) Size (cm, 3rd)
Average 37968 24.023.1 416105 8.819.7 2.22.2
[min,max] [168,553] [-10.5,95.6] [221,804] [-23.2,68.8] [0.1,9.0]
TABLE II: Tumor information at the 1st, 2nd, and 3rd time points of 33 patients.

For the prediction of later future stages (i.e., directly predict time 4 given only time 1 and time 2 available), since only 11 patients have real imaging data and time interval information at time 4, we simply assume that each of the remaining 22 patients’ time 3 and time 4 have the equal time interval as the interim between their time 2 and time 3, in order to investigate the effectiveness of time interval feature in our predictive model on a larger patient cohort (of all patient data).

Iv-B Implementation Details and Compared Methods

Training Details: Three data augmentation schemes are performed to enrich our dataset. Besides the original axial image slice sequences, we 1) reformat the coronal and sagittal slices, 2) rotate (with 90 degree interval), 3) translate (randomly 2 pixels in

plane) for each 4D ICVF-CT-Mask volumetric sequence, and 4) reverse the spatial order. The augmentation results in 172,296 training sub-sequences in total. Such methods add more variations into the generated or augmented dataset and improve the generalization capability. We train our ST-ConvLSTM models for 5 epochs with the batch size of 16. Each data point has 5 slices at 3 time points. We use the ADAM optimizer

[35] for neural network optimization with an initial learning rate of 10

. A thresholding value of 128 is used on the predicted probability map of mask channel to obtain a binary tumor mask.

Comparison: We implement the current clinical practice of a default linear growth model, the conventional ConvLSTM [19], and another major deep learning method for video prediction, i.e., BeyondMSE (GAN) [20], for model comparison. The linear growth model assumes that tumors would keep their past growing trend in the future. As detailed in [11], the past radial expansion/shrink distances on tumor boundaries are used to expand/shrink the current tumor boundary as future prediction. The ConvLSTM uses the same architecture as in Fig. 3 (but it only captures the temporal dependencies) and is trained with the same network optimization setting as ST-ConvLSTM. In the BeyondMSE framework, a multi-scale fully convolutional ConvNet is used as the future image generator, and a multi-scale ConvNet as the discriminator. The generator receives two past images as input and outputs one future image, while the discriminator receives all three images as input to classify whether they are real or fake. Our implementation uses the same network architecture and parameter setting as in [20]

. Both ConvLSTM and BeyondMSE are trained for 5 epochs on the same augmented dataset as ST-ConvLSTM. All these aforementioned models are implemented in TensorFlow

[36] and perform experiments on a DELL TOWER 7910 workstation with 2.40 GHz Xeon E5-2620 v3 CPU, 32 GB RAM, and a Nvidia TITAN X Pascal GPU of 12 GB of memory. Note that compared to previous machine (deep) learning based tumor growth model prediction methods [9, 10, 11] that merely utilize 2D image patches and only predict the future tumor volume, our new ST-ConvLSTM model is holistically 4D (volumetric+time) image-based and enables the predictions of future tumor imaging properties, such as future cell density and CT intensity numbers to assist relevant clinical diagnosis.

Fig. 4: An illustrated example shows the prediction results of CT, mask/volume, and ICVF of a tumor by ST-ConvLSTM and ConvLSTM. Note that the tumor contours are superimposed on the ground truth CT images at time 3. Red: ground truth boundaries; Green: predicted tumor boundaries. In this example, consecutive image slices with the spatial interval of two slices are shown for better visualization of the spatial changes/differences.
Volume-Dice (%) Volume-RVD (%) ICVF-RMSE (%) CT-HUdiff. (%)
Linear 73.06.2 [60.2, 85.1] 22.818.3 [5.1, 75.2] - -
ConvLSTM [19] 82.15.8 [65.6, 90.4] 14.112.4 [1.2, 50.4] 13.78.4 [6.8, 42.4] 10.48.3 [0.6, 32.4]
ST-ConvLSTM 83.25.1* [69.7, 91.1] 11.210.8* [0.3, 46.5] 14.08.5 [7.4, 41.4] 10.28.5 [0.0, 35.0]
TABLE III: Overall quantitative performance on 33 patients under 3-fold cross-validation – Baseline linear predictive model, ConvLSTM [19], and our ST-ConvLSTM. Results are reported as: mean std [min, max]. * indicates a statistically significant difference of our method compared to other methods.

Iv-C Evaluation Methods

We evaluate our model using three-fold cross-validation. In each fold, 22 patients are used as training and the remaining 11 patients as testing data. The performance of tumor prediction is evaluated at the 3 time point by the metrics of Dice coefficient and RVD (relative volume difference) [10, 11, 6] for tumor volume, RMSE (root-mean-squared error) for ICVF [6], and diff.HU (difference of average HU values) for CT value. Both RMSE and diff.HU are evaluated within the true positive volume. Paired -tests are conducted to compare our new model and other previous methods.


where TPV (true positive volume) is the overlapping volume between the predicted tumor volume and the ground truth tumor volume . represents the ICVF value of a pixel. HU represents the average Hounsfield units within a volume.

Iv-D Quantitative Results

The visual example in Fig. 4 shows the prediction results of future CT scan, tumor mask/volume, and ICVF obtained by ST-ConvLSTM and ConvLSTM. In this case, compared with the conventional ConvLSTM, our ST-ConvLSTM generates more spatially consistent predictions for CT, mask and ICVF, and therefore achieves better accuracies under all quantitative metrics. Table III reports the overall performance of our ST-ConvLSTM model with that of ConvLSTM and the linear model on 33 patients. For the volume prediction, ST-ConvLSTM produces a Dice score of 83.2% and a RVD of 11.2%. Both are significantly better than ConvLSTM (0.01 and 0.05) and linear predictive model (0.001 and 0.01). Furthermore, our model generates a RMSE of 14.0% for tumor cell density prediction, and a diff.HU of 10.2% for radiodensity prediction (no statistical significances are achieved on these two metrics in comparison to ConvLSTM).

Figure 5 compares the prediction results of our ST-ConvLSTM with BeyondMSE (GAN)  [20]. In this case, BeyondMSE has reported noticeably worse performance in predicting tumor volume, but generates less blurry CT and ICVF images (through visually observation). Table IV lists the overall prediction performance of BeyondMSE, where the proposed method significantly outperforms BeyondMSE in terms of Dice, RVD, and ICVF-RMSE.

Fig. 5: An example of image slices shows the prediction results of CT, mask/volume, and ICVF of a tumor by ST-ConvLSTM and BeyondMSE (GAN). Note that the tumor contours are superimposed on the ground truth CT images at time 3. Red: ground truth boundaries; Green: predicted tumor boundaries.
Volume-Dice (%) Volume-RVD (%) ICVF-RMSE (%) CT-HUdiff. (%)
BeyondMSE (GAN) [20] 79.35.7 [65.6, 90.4] 20.914.4 [1.2, 50.4] 19.712.0 [6.8, 42.4] 10.78.1 [0.6, 32.4]
ST-ConvLSTM 83.25.1* [69.7, 91.1] 11.210.8* [0.3, 46.5] 14.08.5* [7.4, 41.4] 10.28.5 [0.0, 35.0]
TABLE IV: Comparison between our ST-ConvLSTM and BeyondMSE (GAN) [20] on 33 patients under 3-fold cross-validation. Results are presented as: mean std [min, max]. * indicates a statistically significant difference.

Fig. 6 shows the prediction results at an even later time step using ST-ConvLSTM for all 33 patients. As a result, 78.8% tumors are predicted to keep growing at later time points – the predicted volume at time 4 is larger than time 3. For the 11 tumors which have ground truth measures of tumor volume at time 4, our prediction produces a RVD of 37.2%42.5%.

Fig. 6: Prediction results at an even later time point. Upper panel: volume prediction results at time 4 based on time 1 and time 2 for all 33 patients. 26 out of 33 (78.8%) patients are predicted as tumor keeping growing (i.e., tumor size at time 4 larger than time 3). Lower panel: CT, mask, and ICVF predictions at time 4 for a slice from case 6. Note that time 3’s results are also shown for reference; the tumor contours are superimposed on the predicted CT images at time 4 (since the ground truth CT images at time 4 is not available for this patient). Red: ground truth boundaries; Green: predicted tumor boundaries.

On average, our method takes hrs for training and 0.2 second for prediction per tumor. This performance is faster than the statistical and deep learning framework ( 3.5 hrs training and 4.8 mins prediction [10]) in both training and inference; while faster than the two-stream ConvNets [11] in prediction but slower in training.

Iv-E Discussion

Deep learning based precision and predictive medicine is a new emerging research area, and has been shown to be capable of outperforming traditional mathematical modeling based methods for tumor growth prediction. This may suggest its great potential for solving this complicated but important problem. Because of the tremendous difficulties of collecting the longitudinal tumor data, most previous studies are evaluated on a relatively small sized dataset (i.e., 10 patients). A statistically larger and more representative patient dataset is desired to evaluate the prediction performance. Our novel model, ST-ConvLSTM network, significantly differs from the most recent statistical and deep learning [10] and two-stream ConvNets [11]

in several key aspects. Firstly, it uses a single recurrent neural network to explicitly and jointly model the temporal changes and spatial consistency (i.e., in 4D space), rather than separate invasion and expansion networks to model the temporal information only (i.e., 2D+time)

[10, 11]. Secondly, it makes prediction at the holistic image-level instead of local image patch-level, integrating the global spatial context of tumor structure and meanwhile being more computationally efficient. Thirdly, it enables the prediction of both future images and the associated imaging properties, including CT scan, tumor cell density and radiodensity, as demonstrated in this paper. Fourthly, it uses an encoder-decoder deep neural network architecture that incorporates imaging feature and clinical factor (such as time interval) in an end-to-end learning framework, rather than a late feature fusion stage. Fifthly, we provide the largest longitudinal tumor dataset (33 patients) to date to the best of our knowledge, and comprehensive quantitative evaluation results against three other prediction methods using ConvLSTM [19] and GAN [20]. Finally, we extend our deep learning based method to make it capable of predicting any time point in a later future (beyond time point 3).

One of our main contributions is the novelty of proposed ST-ConvLSTM architecture. Compared to the previous state-of-the-art ConvLSTM [19] model for temporal modeling of 2D image sequences across different time points, we substantially extend the ConvLSTM into the spatial-temporal 4-dimensional space by jointly leaning both the temporal evolution of tumor growth and the spatial information of 3D consistency. Particularly, for the adjacent 2D CT slices, they are also modeled by ConvLSTM (slice-by-slice) to ensure their spatial consistency. In addition, the global contexts of previous time point are fed to the current time point. Therefore, each ST-ConvLSTM unit makes prediction not only based on its local spatial and temporal neighbors, but also from the whole information of past states. As a result, our ST-ConvLSTM is able to generate a sequence of images with better 4D properties than ConvLSTM, e.g., producing statistically higher accuracy in volume prediction, as shown in Table III. An illustrative example can be observed from Fig. 4. ST-ConvLSTM generates more consistent tumor morphology and structure for CT, mask, and ICVF predictions than ConvLSTM results (of irregular predictions for tumor morphology). An alternative option of using ConvLSTM for the 4D prediction task can simply stack 2D CT slices as different input channels and modeling the temporal relation using LSTM. However such a method cannot exploit either the inherent correlations of inter-slice correlations in 3D contexts, or temporal dynamics across time points. The simple linear predictive model performs the worst among all compared methods. This is in agreement with the fact that the pancreatic neuroendocrine tumors demonstrate nonlinear growth patterns [37, 38].

Besides ConvLSTM, BeyondMSE (GAN) [20] is another deep learning model for future frame prediction. Benefited from the , image gradient based optimization and adversarial losses, GAN could generate less blurry future image predictions, as shown in Fig. 5. However GAN has much lower quantitative prediction performance than our method. One reason may be that GAN does not explicitly model the temporal dynamics, while LSTM has inherent temporal “memory” units though GAN-based tumor prediction can somewhat capture the tumor growing trend. For example, in Fig. 5, from time 1 to time 2, the tumor invasion happens mostly in its lower part so that GAN predicts the tumor to continue infiltrating to the below area at time 3. Nevertheless the tumor actually slows down its growing speed at time 3 in that direction. Our ST-ConvLSTM model learns the spatial-temporal information jointly and can leverage the current slice’s global and local neighbors’ information, which results in more robust prediction. On the other hand, the GAN-based method may have a more severe potential overfitting problem on our task. The network architectures proposed in [20] can be over-complicated for the relatively small-sized data studied in this work.

For the prospect of longer future prediction of tumor growth, our model predicts that 78.8% tumors keep growing at time 4. This is in accordance with the natural history of PNET tumors, around 20% decreasing over a median follow-up duration of 4 years [37]. This indicates that the time interval feature is effective to control our predictive model to generate sensible prediction results. The prediction accuracy at a longer future time point (i.e., RVD=37.2% at time 4) is much lower than that of the next predictable time step (i.e., RVD=11.2% at time 3). This is as expected since it is indeed harder to precisely predict the tumor growth trends and patterns after a longer period of time, for example, around two years later using our data. As a reference, a recent mathematical modeling based tumor growth prediction method [7] has the relative volume errors of later time predictions, ranging from 45% to 123% for breast carcinoma. Another solution for predicting further into the future is to recursively apply the two-time-input model as in [25], i.e., predicting the outcome of time 4 based on the time 2 and the predicted time 3 results.

There are some future directions which may further improve our method. First, the loss function used in our model is the major reason that causes blurry predictions. Adversarial training [20]

can increase the sharpness of the predicted image and is straightforward to be incorporated into our ST-ConvLSTM network, through using a discriminator to determine whether the generated future image sequence is real or fake during training. Second, alternative network architectures, such as skip and residual connections

[22, 14] may complement our current encoder-decoder network as the backbone.

V Conclusion

In this paper, we have employed and substantially extended ConvLSTM [19] in the 4-dimensional spatial-temporal domain for the task of modeling 4D longitudinal tumor data. The novel ST-ConvLSTM network jointly learns the intra-slice structures, the inter-slice 3D contexts, and the temporal dynamics. Quantitative results of notably higher accuracies than the original ConvLSTM [19] are reported, using several metrics on predicting the future tumor volumes. Compared to the most recent 2D+time deep learning based tumor growth prediction models [10, 11], our new approach directly works on 4D imaging space and incorporates clinical factors in an end-to-end trainable manner. This method can also predict the tumor cell density and radiodensity. Our experiments are conducted on the largest longitudinal pancreatic tumor dataset (33 patients) to date and demonstrate the validity of our proposed method. The presented ST-ConvLSTM model can potentially enable other applications of 3D/4D medical imaging prediction problems.


  • [1] K. R. Swanson, E. Alvord, and J. Murray, “A quantitative model for differential motility of gliomas in grey and white matter,” Cell Proliferation, vol. 33, no. 5, pp. 317–329, 2000.
  • [2] O. Clatz, M. Sermesant, P.-Y. Bondiau, H. Delingette, S. K. Warfield, G. Malandain, and N. Ayache, “Realistic simulation of the 3D growth of brain tumors in MR images coupling diffusion with biomechanical deformation,” TMI, vol. 24, no. 10, pp. 1334–1346, 2005.
  • [3] C. Hogea, C. Davatzikos, and G. Biros, “An image-driven parameter estimation problem for a reaction–diffusion glioma growth model with mass effects,” Journal of Mathematical Biology, vol. 56, no. 6, pp. 793–825, 2008.
  • [4] B. H. Menze, K. Van Leemput, A. Honkela, E. Konukoglu, M.-A. Weber, N. Ayache, and P. Golland, “A generative approach for image-based modeling of tumor growth,” in IPMI.    Springer, 2011, pp. 735–747.
  • [5] Y. Liu, S. Sadowski, A. Weisbrod, E. Kebebew, R. Summers, and J. Yao, “Patient specific tumor growth prediction using multimodal images,” Medical Image Analysis, vol. 18, no. 3, pp. 555–566, 2014.
  • [6] K. C. L. Wong, R. M. Summers, E. Kebebew, and J. Yao, “Pancreatic tumor growth prediction with elastic-growth decomposition, image-derived motion, and FDM-FEM coupling,” TMI, vol. 36, no. 1, pp. 111–123, 2017.
  • [7] T. Roque, L. Risser, V. Kersemans, S. Smart, D. Allen, P. Kinchesh, S. Gilchrist, A. L. Gomes, J. A. Schnabel, and M. A. Chappell, “A DCE-MRI driven 3-D reaction-diffusion model of solid tumor growth,” TMI, vol. 37, no. 3, pp. 724–732, 2018.
  • [8] L. Weizman, L. Ben-Sira, L. Joskowicz, O. Aizenstein, B. Shofty, S. Constantini, and D. Ben-Bashat, “Prediction of brain MR scans in longitudinal tumor follow-up studies,” in MICCAI.    Springer, 2012, pp. 179–187.
  • [9] M. Morris, R. Greiner, J. Sander, A. Murtha, and M. Schmidt, “Learning a classification-based glioma growth model using MRI data,” Journal of Computers, vol. 1, no. 7, pp. 21–31, 2006.
  • [10] L. Zhang, L. Lu, R. M. Summers, E. Kebebew, and J. Yao, “Personalized pancreatic tumor growth prediction via group learning,” in MICCAI, 2017, pp. 424–432.
  • [11] ——, “Convolutional invasion and expansion networks for tumor growth prediction,” TMI, vol. 37, no. 2, pp. 638–648, 2018.
  • [12] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,” in ICLR, 2017.
  • [13] C. Lu, M. Hirsch, and B. Schölkopf, “Flexible spatio-temporal networks for video prediction,” in CVPR, 2017, pp. 6523–6531.
  • [14] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” in ICML, 2017.
  • [15] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion GAN for future-flow embedded video prediction,” in ICCV, 2017.
  • [16] Y. Wang, M. Long, J. Wang, Z. Gao, and S. Y. Philip, “Predrnn: Recurrent neural networks for predictive learning using spatiotemporal LSTMs,” in NIPS, 2017, pp. 879–888.
  • [17]

    W. Liu, W. Luo, D. Lian, and S. Gao, “Future frame prediction for anomaly detection–a new baseline,” in

    CVPR, 2018, pp. 6536–6545.
  • [18]

    N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using LSTMs,” in

    ICML, 2015, pp. 843–852.
  • [19] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional LSTM network: A machine learning approach for precipitation nowcasting,” in NIPS, 2015, pp. 802–810.
  • [20] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in ICLR, 2016.
  • [21] V. Patraucean, A. Handa, and R. Cipolla, “Spatio-temporal video autoencoder with differentiable memory,” in ICLR Workshop, 2016.
  • [22] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in NIPS, 2016, pp. 64–72.
  • [23] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” in ICLR, 2017.
  • [24] R. Villegas, J. Yang, Y. Zou, S. Sohn, X. Lin, and H. Lee, “Learning to generate long-term future via hierarchical prediction,” in ICLR, 2017.
  • [25] P. Luc, N. Neverova, C. Couprie, J. Verbeek, and Y. LeCun, “Predicting deeper into the future of semantic segmentation,” in ICCV, vol. 1, 2017.
  • [26] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala, “Video frame synthesis using deep voxel flow.” in ICCV, 2017, pp. 4473–4481.
  • [27] P. Bhattacharjee and S. Das, “Temporal coherency based criteria for predicting video frames using deep multi-stage generative adversarial networks,” in NIPS, 2017, pp. 4268–4277.
  • [28] X. Jin, H. Xiao, X. Shen, J. Yang, Z. Lin, Y. Chen, Z. Jie, J. Feng, and S. Yan, “Predicting scene parsing and motion dynamics in the future,” in NIPS, 2017, pp. 6915–6924.
  • [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [30] J. Chen, L. Yang, Y. Zhang, M. Alber, and D. Z. Chen, “Combining fully convolutional and recurrent neural networks for 3D biomedical image segmentation,” in NIPS, 2016, pp. 3036–3044.
  • [31] J. Cai, L. Lu, Y. Xie, F. Xing, and L. Yang, “Improving deep pancreas segmentation in CT and MRI images via recurrent neural contextual learning and direct loss function,” in MICCAI, 2017.
  • [32] K.-L. Tseng, Y.-L. Lin, W. Hsu, and C.-Y. Huang, “Joint sequence learning and cross-modality convolution for 3D biomedical segmentation,” in CVPR, 2017, pp. 3739–3746.
  • [33] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. Hill, M. O. Leach, and D. J. Hawkes, “Nonrigid registration using free-form deformations: application to breast MR images,” TMI, vol. 18, no. 8, pp. 712–721, 1999.
  • [34] P. A. Yushkevich, J. Piven, H. C. Hazlett, R. G. Smith, S. Ho, J. C. Gee, and G. Gerig, “User-guided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability,” NeuroImage, vol. 31, no. 3, pp. 1116–1128, 2006.
  • [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference for Learning Representations, 2015.
  • [36] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. [Online]. Available:
  • [37] A. B. Weisbrod, M. Kitano, F. Thomas, D. Williams, N. Gulati, K. Gesuwan, Y. Liu, D. Venzon, I. Turkbey, P. Choyke et al., “Assessment of tumor growth in pancreatic neuroendocrine tumors in von hippel lindau syndrome,” Journal of the American College of Surgeons, vol. 218, no. 2, pp. 163–169, 2014.
  • [38] X. M. Keutgen, P. Hammel, P. L. Choyke, S. K. Libutti, E. Jonasch, and E. Kebebew, “Evaluation and management of pancreatic lesions in patients with von hippel-lindau disease,” Nature Reviews Clinical Oncology, vol. 13, no. 9, pp. 537–549, 2016.