## 1 Introduction

Glioma growth modeling refers to the problem of estimating how the tumor burden in a patient might evolve in the future given the current state of the disease and possibly its history. In the past, this was often approached with so-called reaction-diffusion models, where tumor growth is described by the physical diffusion of an assumed tumor cell density. There is a large body of related literature, so we only refer to [mang_integrated_2020, menze_image-based_2011] as overviews. Newer works not included therein are [subramanian_multiatlas_2020, ezhov_neural_2019], among others. All of these approaches have in common that they depend on the quality of the chosen forward model, which always includes simplifying assumptions when compared to the actual biological processes.

A recent alternative was proposed in [petersen_deep_2019] in the form of fully learned growth models. The authors used a probabilistic U-Net to map a fixed-length input sequence of MRI scans to a distribution of possible spatial tumor configurations in the future. The model thus learns a growth model directly from data and doesn’t rely on explicit assumptions and simplifications with respect to the underlying biological process. At the same time, the approach suffers from several limitations, requiring a fixed number of input observations, requiring fixed time intervals between observations, and being able to make predictions only for a single predefined future time point. In order to overcome all of these limitations, we build upon the Neural Process framework [garnelo_conditional_2018, garnelo_neural_2018], a family of conditional generative models specifically conceived to model continuous-valued stochastic processes. They work by encoding so-called *context* observations into a joint representation and then predicting a desired *target* output from it. In particular, we extend the work from [eslami_neural_2018, kumar_consistent_2018], who apply Neural Processes in the image domain, by introducing a convolutional encoder-decoder structure with representations at multiple scales. The single-scale encoder-decoder setups used in [eslami_neural_2018, kumar_consistent_2018] require extremely large decoders to produce outputs with sufficient spatial resolution. As a result, they require vast amounts of data to be trained successfully, and the authors only demonstrate their approaches on synthetic images. Ours is one of the first Neural Processes to be applied to real data in the image domain^{1}^{1}1*Image domain* refers to the fact that the observations are entire images. The original Neural Processes [garnelo_conditional_2018, garnelo_neural_2018] work on images by treating individual pixels as observations., with [kia_neural_2019] the only other work we’re aware of. To allow our approach to model tumor changes with high spatial resolution, we also introduce an attention mechanism [vaswani_attention_2017] that aggregates the context information at different scales. In contrast to existing work with attention in Neural Processes [kim_attentive_2019, rosenbaum_learning_2018], we design the mechanism so that it can model spatial and temporal dependencies simultaneously.

In summary, this work makes the following contributions: 1.) We propose the first fully learned tumor growth model that can make predictions at arbitrary points on the time axis, conditioned on an arbitrary number of available observations. 2.) In doing so, we propose a Neural Process variant for image-based problems that is efficient enough to work successfully on real-world data. 3.) We show that the proposed approach significantly outperforms prior work on the task of learning-based modeling of glioma growth. We compare our approach with both the model proposed in [petersen_deep_2019] and a naive convolutional Neural Process. 4.) We provide a complete implementation at https://github.com/MIC-DKFZ/deep-glioma-growth.

## 2 Methods

### 2.1 Model

The model we propose is shown in Fig. 1. We have a number of observations (MRIs and corresponding tumor segmentations) available at continuous time values . We call those the *context*, with bold indicating a collection henceforth. We are interested in possible tumor segmentations for future *target* times . Each context observation is encoded separately with an encoder that outputs representations hierarchically at multiple scales: . The second index refers to the scale (the first entry has a resolution of , etc.) and the global representations

describe mean and standard deviation of diagonal normal distributions. These means and standard deviations are summed over all context encodings (bottom path in

Fig. 1) to form a single global distribution. Because it is independent of the target , this allows us to sample growth trajectories that are consistent over time and that can be evaluated at arbitrary new points in time. A sample from the distribution (which is 128-dimensional in our case) along with is fed to the decoder to predict the corresponding segmentation. If only the global representation is used, our model reduces to a Neural Process [garnelo_conditional_2018, garnelo_neural_2018] with a convolutional encoder and decoder. We want our approach to be able to model spatial and temporal dependencies simultaneously, so we aggregate the context information at higher scales using a spatio-temporal dot-product attention mechanism [vaswani_attention_2017], conditioned on the target and pixel locations at the given scale (see inset in Fig. 1):(1) | |||

(2) |

are linear maps from what is usually referred to as *queries* (target locations ), *keys* (available locations ) and *values* (the representations at the *key* locations). We also use to construct and because of the success of self-attention in transformer architectures, e.g. [devlin_bert:_2018]. This means that the available representations are weighted according to the similarity of the corresponding key to the queries. The similarity is given by the dot product in a higher-dimensional space (with dimension in our case). Note that in Fig. 1, each attention block represents so-called multihead attention, meaning 8 parallel blocks of the above. The downside of this spatio-temporal attention is that the matrix product requires GPU memory that scales quadratically with the number of pixels. As a result we only use the full spatio-temporal attention at the two lowest scales (resolutions and ) and resort to temporal attention at higher scales, i.e. the same as above but without a dependence on pixel locations.

Our model is trained with the following optimization objective, where are inputs (MRI scans) while refers to segmentations. During training the context is absorbed into the target set (, etc.), meaning we let the models reconstruct the inputs as well instead of only predicting a future segmentation:

(3) |

where is the global representation predicted by the encoder for the given inputs. This is the Neural Process variant of the well-known variational inference objective, with some modifications: 1.) There is a direct dependence of the reconstruction likelihood on , because the attention mechanisms are deterministic (otherwise the influence would only be via ). 2.) The predictive likelihood is usually modeled as a factorizing categorical likelihood, leading to a cross-entropy loss. We instead use the sum of cross-entropy and Dice losses [drozdzal_importance_2016]. As a result, the likelihood no longer factorizes over pixels, so 3.) we introduce a weight for the KL term, as done for example in [higgins_beta-vae:_2016], which re-balances the two loss terms. We optimized this parameter through line-searching the space of possible values in powers of 10, choosing the one that lead to KL loss values most similar to those obtained during training with only cross-entropy: . Note that the objective for the baseline from [petersen_deep_2019] can be written the same way, but without dependence on

. Models are trained for 300 epochs with Adam

[kingma_adam_2015] and a learning rate ofat batch size 128. We give between 2 and 5 inputs as context and always have a single target point, but the models predict both context and target during training. For all other hyperparameters we refer to the provided implementation.

### 2.2 Data & Evaluation

We use a dataset that consists of MRI scans from 379 glioblastoma patients, stemming from a multi-center study that compared treatment using a combination of chemotherapy (Lomustine) and an angiogenesis inhibitor (Bevacizumab) with treatment using chemotherapy alone [wick_lomustine_2017]. The study found no significant difference between test and control arm in terms of overall survival, so we ignore treatment effects in our work and treat it as an additional source of stochasticity. Pre-processing of the data including annotations (segmentations of edema, enhancing tumor and necrosis) was performed by experienced neuroradiologists as described previously [kickingereder_automated_2019]. For each patient there are between 3 and 13 (mean ) longitudinal scans, consisting of native T1, contrast-enhanced T1 (with gadolinium agent), T2 and FLAIR. All scans for a patient are skull-stripped, registered to the native T1 image of the first available scan, and normalized by mean and standard deviation on a per-channel basis. We resample the original data to isotropic and then extract a region of size centered around the tumor. We find that this offers a good trade-off between resolution and compute requirements. Finally, we work on axial slices of the resulting arrays, making extensive use of data augmentation during training. A full list of transforms is given in the implementation. Note that our model only uses skip connections with a resolution of at most . For the evaluation we perform a random 80/20 train/test split of the data (stratified at the patient level to avoid data leakage). Hyperparameter optimization is done via cross-validation on the training set, for the final results we train all models on the full training set and report scores on the test set, leaving out those instances where the true Dice overlap between the target point and the last context point is 0.

We compare our approach with the one presented in [petersen_deep_2019] using their official implementation and a hyperparameter configuration that matches our model (U-Net layer size and depth, loss, etc.). We further compare with a regular Neural Process [garnelo_neural_2018], which is a special case of our model where all the attention skip-connections are removed (see Fig. 1). Because the approach in [petersen_deep_2019] works with a fixed number of input timesteps, we train 4 separate models (2-5 input timesteps) and evaluate them jointly. We look at three different metrics: 1) The *Test Loss*, which is just the optimization objective (Eq. 3) evaluated on the test set, but only using the test points and not the context reconstructions. It is by definition the optimal measure of how well the models perform the task they are trained to perform. Note that this measure is only meaningful when all models are trained with the same loss, which we do. 2) The *Surprise* is the KL divergence, i.e. the second summand in Eq. 3. It can be interpreted as a measure of how much the model has to adjust its prior belief when presented with the true future observation (the posterior). 3) The *Query Volume Dice* takes 100 random samples from the predicted prior and measures the Dice overlap for the sample that best matches the observed future in terms of whole tumor volume. It is best interpreted from an application perspective, when clinicians are interested in possible spatial growth estimates *conditioned* on a potential volume increase. Note that we are mainly interested in a comparison with other learned growth models, and the chosen metrics are not suitable for a comparison with diffusion-based models. We leave such an evaluation for future work.

Test Loss | Surprise | Query Vol. Dice | ||

Parameters | [1e-2] | [nats] | [1e-2] | |

Neural Process [garnelo_neural_2018] | M | |||

Learned Discrete [petersen_deep_2019] | M | |||

Learned Continuous (ours) | M |

indicate that higher/lower is better, bold marks the best-performing method. Errors represent the standard error of the mean.

*Test Loss*evaluates Eq. 3 on the test set,

*Surprise*measures the distance between prior and posterior distributions.

*Query Volume Dice*draws 100 samples from the prior and evaluates the Dice overlap for the sample that is closest to the true future segmentation in terms of tumor volume. Our model outperforms the baselines in all metrics.

## 3 Results

An exemplary case is shown in Fig. 2, with predictions from all models. We find that our model produces predictions with high spatial resolution. It can reconstruct the context segmentations very accurately, while the Neural Process is unable to create sufficient spatial detail. The mean prediction from our model overestimates how fast the necrosis (yellow in Fig. 2) in the tumor center expands, but it is important to consider that the observed future isn’t necessarily the most likely one, so it is expected to find such differences. More importantly, when we sample from the predicted prior and take the samples that are closest to the observed future in terms of tumor volume, our model matches the observed future closely. The model from [petersen_deep_2019] can produce spatial detail comparable to our model, but it can only make a prediction at a single fixed time relative to the context observations.

The average performance of all methods on the test set is shown in Table 1. We evaluate the *Test Loss* (i.e. Eq. 3, but only on target points, not context reconstructions), the *Surprise*, meaning the KL divergence between prior and posterior, as well as the *Query Volume Dice*, which is the Dice for the sample that best matches the observed future in terms of tumor volume (out of 100 samples from the prior). Our method significantly outperforms the competing approaches in all metrics, with only a little more than half the number of parameters that the model from [petersen_deep_2019] uses, because ours doesn’t require separate prior and posterior networks. Interestingly, the test loss of the Neural Process [garnelo_neural_2018] is actually lower than that of the learned discrete model [petersen_deep_2019], even though its perceived quality is much lower (see Fig. 2). It should be noted that the numerical results are not comparable to those reported in [petersen_deep_2019], because we use a slightly different training objective. There are many test cases with only very small changes between the test timestep and the previous context observation. To understand how the performance depends on changes that take place between those times, we iterate a threshold over their true Dice overlap to evaluate the average *Predictive Dice* and Query Volume Dice on the subset of cases below the given threshold. The result is displayed in Fig. 3, where we also include a predictive threshold, which is the average true overlap below the given threshold. Below that line, a model performs worse than one that simply predicts no change between consecutive times. Ours is the only approach that is always above the predictive threshold, while the others actually fall *below* it at higher cutoffs.

## 4 Discussion

This work proposes a fully learned glioma growth model, first introduced in [petersen_deep_2019] as an alternative to commonly used biological diffusion models [mang_integrated_2020, menze_image-based_2011]. While successful, the approach from [petersen_deep_2019] has a number of practical limitations: it requires a fixed number of context observations, it requires a fixed time interval between consecutive observations and it can only make a prediction one time interval step into the future. Our proposed model overcomes all of those limitations and can be conditioned on arbitrarily many context observations on a continuous time axis. From these context observations, our model predicts a distribution over growth trajectories, where each sample is temporally consistent and can be evaluated at any desired continuous-valued time. Our model also significantly outperforms the one from [petersen_deep_2019] in several metrics, which we demonstrate on a dataset ten times larger than the one used in [petersen_deep_2019]. Our model’s main limitation is the high GPU memory requirement of the spatio-temporal attention mechanism we introduce. This is a problem many attention and transformer architectures suffer from, and ways to make them less resource intensive are actively being researched [kitaev_reformer_2020, wang_linformer_2020]. It’s also the reason why we performed our experiments on two-dimensional slices rather than full 3D volumes. As a result, one should be careful not to draw any premature conclusions with respect to a possible clinical application. While our results look promising, the value of our model for clinical purposes, for example in radiation therapy, must be validated extensively. We leave such efforts for future work. A comparison with diffusion-based models, particularly in a radiation therapy context [le_personalized_2017, lipkova_personalized_2019], is another interesting opportunity for future work. Building on the Neural Process framework [garnelo_conditional_2018, garnelo_neural_2018], our proposed approach constitutes an efficient Neural Process variant for image time series and is, to the best of our knowledge, only the second time Neural Processes have been demonstrated on real data in the image domain [kia_neural_2019]. We believe it can prove useful for both other types of tumor growth as well as any other kind of stochastic time series with image data.

## Acknowledgements

Part of this work was funded by the Helmholtz Imaging Platform (HIP), a platform of the Helmholtz Incubator on Information and Data Science.