1 Introduction
In medical imaging, an important task is to analyze temporal image sequences to understand physiological processes of the human body. Dynamic organs, such as the heart or lungs, are of particular interest to study as detected motion patterns are helpful for the diagnosis and treatment of diseases. Moreover, recovering the motion pattern allows to track anatomical structures, to compensate for motion, to do temporal superresolution and motion simulation.
Motion is typically studied by computing pairwise deformations – the registration of each of the images in a sequence with a target image. The resulting dense deformation fields track moving structures from the beginning to the end of the sequence. Providing an invertible and smooth transformation, diffeomorphic registration algorithms such as the SyN algorithm [1] are especially suited for the registration of sequential images. One difficulty is to acquire temporally smooth deformations that are fundamental for tracking. That is why registration algorithms with a temporal regularizer have been proposed [7, 8]
. In the computer vision community, temporal video superresolution and motion compensation are of related interest
[4].However, while these methods produce accurate dense deformations, they do not aim to extract intrinsic motion parameters crucial for building a comprehensive motion model useful for analysis tasks such as motion classification or simulation. Rohé et al. [9] proposed a parameterization, the Barycentric Subspaces, as a regularizer for cardiac motion tracking. Yang et al. [10] generated a motion prior using manifold learning from lowdimensional shapes.
We propose to learn a probabilistic motion model from image sequences directly. Instead of defining a parameterization explicitly or learning from preprocessed shapes, our model captures relevant motion features in a lowdimensional motion matrix in a generic but datadriven way. This learned latent space can be used to fill gaps of missing frames (motion reconstruction), to predict the next frames in the sequence or to generate an infinite number of new motion patterns given only one image (motion simulation). Motion can be also transported by applying the motion matrix on an image of another subject.
We apply a Bayesian inference method
[5] in conjunction with a temporal regularizer [2] with explicit time dependence to learn the probabilistic motion encoding. To enforce temporal consistency, we introduce the selfsupervised training scheme of temporal dropout sampling, which randomly leaves out input frames while trying to recover the complete motion sequence. The framework is learned in an unsupervised fashion from complete image sequences. Smooth, diffeomorphic and symmetric deformations are ensured by applying an exponentiation layer, spatiotemporal regularization and a symmetric local crosscorrelation metric. Besides motion simulation, the model demonstrates stateoftheart registration results for diffeomorphic tracking of cardiac cineMRI. The main contributions are as follows:
[noitemsep]

An unsupervised probabilistic motion model learned from image sequences

A generative model using explicit timedependent temporal convolutional networks trained with selfsupervised temporal dropout sampling

Demonstration of cardiac motion tracking, simulation, transport and temporal superresolution
2 Methods
The motion observed in an image sequence with frames is typically described by deformation fields between a moving image and the fixed images with . Inspired by the probabilistic deformation model of [6]
based on conditional variational autoencoder (CVAE)
[5], we define a motion model for temporal sequences. The model is conditioned on the moving image and parameterizes the set of diffeomorphisms in a lowdimensional probabilistic space, the motion matrix , where is the size of the deformation encoding per image pair. Each column’s code corresponds to the deformation . To take temporal dependencies into account, is conditioned on all past and future time steps. To learn this temporal regularization directly from data, we apply Temporal Convolutional Networks [2] with explicit time dependence and temporal dropout sampling enforcing the network to fill time steps by looking at given past and future deformations. An illustration of the model is shown in Fig. (a)a.2.0.1 Probabilistic Motion Model
Our motion model consists of three distributions. First, the encoder maps each of the image pairs independently to a latent space denoted by
. Second, as the key component of temporal modeling, these latent vectors
are jointly mapped to the motion matrix by conditioning them in all past and future time steps and on the normalized time : . Finally, the decoder aims to reconstruct the fixed image by warping the moving image with the deformation . This deformation is extracted from the temporally regularized codes and conditioned on the moving image.The distributions , ,
are approximated by three neural networks with trainable parameters
, , . During training, a lower bound on the data likelihood is maximized with respect to a prior distribution of the latent space (cf. CVAE [5]). The prioris assumed to follow a multivariate unit Gaussian distribution with spherical covariance
: . The objective function results in optimizing the expected loglikelihood and the KullbackLeibler (KL) divergence enforcing the posterior distribution to be close to the prior for all time steps:(1) 
Unlike the traditional CVAE model, the temporal regularized code is used in the loglikelihood term instead of the . We model as a symmetric local crosscorrelation Boltzmann distribution with the weighting factor . Encoder and decoder weights are independent of the time
. Their network architecture consists of convolutional and deconvolutional layers with fullyconnected layers for mean and variance predictions in the encoder part
[5]. We use an exponentiation layer for the stationary velocity field parameterization of diffeomorphisms [6], a linear warping layer and diffusionlike regularization with smoothing parameters in spatial and in temporal dimension.2.0.2 Temporal Convolutional Networks with Explicit Time Dependence
Since the parameters of encoder and decoder are independent of time, the temporal conditioning
plays an important role in merging information across different time steps. In our work, this regularization is learned by Temporal Convolutional Networks (TCN). Consisting of multiple 1D convolutional layers with increasing dilation, TCN can handle input sequences of different lengths. TCN have several advantages compared to recurrent neural networks such as a flexible receptive field and more stable gradient computations
[2].The input of the TCN is the sequence of concatenated with the normalized time . Providing the normalized time explicitly, provides the network with information on where each is located in the sequence. This supports the learning of a motion model from data representing the same type of motion with varying sequence lengths. The output of the TCN is the regularized motion matrix
. We use noncausal instead of causal convolutional layers to also take future time steps into account. We follow the standard implementation using zeropadding and skip connections. Each layer contains
filters. A schematic representation of our TCN is shown in Fig. (b)b.2.0.3 Training with Temporal Dropout Sampling
Using Eq. 1 for training could lead to learning the identity transform in the TCN such that deformations of the current time step are independent of past and future time steps. To avoid this and enforce the model to search for temporal dependencies during the training, we introduce the concept of temporal dropout sampling (TDS). In TDS, some of the are sampled from the prior distribution instead of only sampling from the posterior distribution as typical for CVAE. At the time steps the prior has been used for sampling, the model has no knowledge of the target image and is forced to use the temporal connections within the TCN in order to minimize the objective.
More precisely, at each time step , a sample from the prior distribution is selected instead of a posterior sample
using a binary Bernoulli random variable
. All independent Bernoulli random variableshave the success probability
. The latent vector can be defined as:(2) 
Fig. (c)c illustrates the TDS procedure. At test time, for each time step independently, one can either draw from the prior or take the encoder’s prediction.
3 Experiments
We evaluate our motion model on 2D cardiac MRIcine data. First, we demonstrate accurate temporal registration by evaluating motion tracking and compensation of the cardiac sequence, taking the enddiastolic (ED) frame as the moving image . Stabilization, is accomplished by warping all frames to the ED frame. Pairwise registration results are presented for EDES (endsystolic) frame pairs. Second, we present motion transport, motion sampling and reconstruction with a limited number of frames.
3.0.1 Data
We used 334 shortaxis sequences acquired from different hospitals including 150 sequences from the Automatic Cardiac Diagnosis Challenge 2017 (ACDC [3]). The remaining cases were obtained from the EU FP7funded project MDPaedigree, mixing congenital heart diseases with healthy and pathological images from adults. The sequence length varied from 13 to 35 frames. We used the 100 cases from ACDC that contain EDES segmentation information for testing while the remaining sequences were used for training. All slices were resampled with a spacing of 1.5 1.5 mm and cropped to a size of 128 128 pixels.
3.0.2 Implementation Details
The encoder consisted of 4 convolutional layers with strides (2, 2, 2, 1) and the decoder had 3 deconvolutional and 1 convolutional layer before the exponentiation and warping layers (Fig.
(a)a). The TCN consists of four layers with dilation (1, 2, 4, 8) and a kernel size of 3 (cf. Fig. (b)b). The regularization parameters and were set to 3 mm respectively 1.5. The loss weighting factor was chosen empirically as . The deformation encoding size was set to 32. The dropout sampling probabilitywas 0.5. We applied the Adam optimization method with a learning rate of 0.00015 and a batch size of one. We performed data augmentation onthefly by randomly shifting, rotating, scaling and mirroring images. We implemented the model in Tensorflow with Keras. The training time was 15h on a NVIDIA GTX TITAN X GPU.
3.1 Registration: Tracking and Motion Compensation
We compare our model with and without the temporal dropout sampling (Our w/o TDS) with three stateoftheart methods: SyN [1], the learningbased algorithm (LPR, 2D singlescale version [6]) and the bsplinebased 4D algorithm in elastix [7]. In Fig. 2, tracking results are visualized for the test data taking sequences of all slices where segmentation in ED and ES frames are available, resulting in 677 sequences. We report the root mean square error (RMSE), the spatial (Spatial Grad.) and temporal gradient (Temp. Grad) of the displacement fields for evaluating the smoothness of the resulting deformations. Our model shows spatially and temporal smoother deformations. We also report DICE scores and 95%tile Hausdorff distances (HD in mm) on five anatomical structures: myocardium (LVMyo) and epicardium (LV) of the left ventricle, left bloodpool (LVBP), right ventricle (RV) and LV+RV. Note, that DICE scores and HD were evaluated on EDES frame pairs only, while the other metrics were computed on the full sequence. The proposed method showed improved mean DICE scores and smaller mean HD of 84.6%, 6.2mm (w/o TDS: 84.7%, 6.1mm) compared to SyN, LPR, 4DElastix with (82.7%, 7.0mm), (82.1%, 6.6mm) respectively (83.7%, 6.3mm%). Compared to training without temporal dropout sampling, HD and DICE scores show minimal differences. However, only TDS offers consistent motion simulation and temporal interpolation. Also, deformation regularity is improved using TDS.
Visual results for one case are shown on the left of Fig. 3. On the figure’s right side the LV volume curves computed by warping the ED mask are plotted. One can see that the SyN algorithm underestimates big deformations. Both, the volume and the gradient metrics show smoother deformations for both versions of our motion model compared to the SyN, LPR and 4DElastix algorithms.
3.2 Sampling, Sequence Reconstruction and Motion Transport
We then evaluate motion sampling, reconstruction and transport. We extract simulated reconstructed motion patterns if we provide our model with different subsets of images from the original sequence. In the time steps without frames, the motion matrix is created by randomly sampling from the prior distribution as depicted in Fig. (c)c (take same for all slices of one volume). Here, we choose sampling over interpolation in
space to remain an uncertain, a probabilistic, estimation of the interpolated deformations. The left side of Fig.
4 shows LV volume curves and reconstruction errors if every second, every fifth, only the 10th, no frame (sampling) or the first 5 frames are provided besides the moving frame (ED). The LV volume errors are computed on all test sequences by taking the mean absolute differences between sampled and tracking volumes. We compare with linear and cubic interpolation of velocities (extracted from tracking) between given times. One can see that our model’s relative performance in recovering the LV volume increases when fewer frames are provided. Given the first five frames or only one additional frame (10th frame), the model estimates the motion consistently with plausible cardiac motion patterns for the missing time steps. In the cases of providing every second and every fifth frame, our method performs equally good or marginally better. Note, the motion simulation given only the ED frame (sampling) does not overlap with the original motion, which is not intended. Nevertheless, one can see cardiac specific motion patterns such as the plateau phase before the atrial systole.Motion can be transported by taking the motion matrix from one sequence and apply it on the ED frame of another sequence. The right side of Fig. 4 shows the transport of a pathological motion to a healthy subject and vice versa. The resulting simulated motion shows similar heart contractions and motion characteristics as the originating motion while the transported deformations are adapted to the target image without requiring explicit intersubject registration.
4 Conclusions
In this paper, we presented an unsupervised approach for learning a motion model from image sequences. Underlying motion factors are encoded in a lowdimensional probabilistic space, the motion matrix, in which each column represents the deformation between two frames of the sequence.
Our model demonstrated accurate motion tracking and motion reconstruction from missing frames, which can be useful for shorter acquisition times and temporal superresolution. We also showed motion transport and simulation by using only one frame. For future work, we aim to further explore the spatial coherence between slices for 3D applications and the influence of using different training datasets (pathological and nonpathological) on the learned motion matrix.
Disclaimer: This feature is based on research, and is not commercially available. Due to regulatory reasons its future availability cannot be guaranteed.
References
 [1] Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C.: Symmetric diffeomorphic image registration with crosscorrelation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12(1), 26–41 (2008)
 [2] Bai, S., Kolter, J.Z., et al.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018)

[3]
Bernard, O., et al.: Deep learning techniques for automatic MRI cardiac multistructures segmentation and diagnosis: Is the problem solved? IEEE Transactions on Medical Imaging
37(11), 2514–2525 (2018) 
[4]
Caballero, J., Ledig, C., Aitken, A., et al.: Realtime video superresolution with spatiotemporal networks and motion compensation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4778–4787 (2017)

[5]
Kingma, D.P., et al.: Semisupervised learning with deep generative models. In: Advances in Neural Information Processing Systems. pp. 3581–3589 (2014)
 [6] Krebs, J., Delingette, H., Mailhé, B., et al.: Learning a probabilistic model for diffeomorphic registration. IEEE Transactions on Medical Imaging 38 (Feb 2019)
 [7] Metz, C., Klein, S., Schaap, M., van Walsum, T., Niessen, W.J.: Nonrigid registration of dynamic medical imaging data using nd+ t bsplines and a groupwise optimization approach. Medical image analysis 15(2), 238–249 (2011)
 [8] Qin, C., Bai, W., et al.: Joint learning of motion estimation and segmentation for cardiac mr image sequences. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 472–480. Springer (2018)
 [9] Rohé, M.M., Sermesant, M., Pennec, X.: Lowdimensional representation of cardiac motion using barycentric subspaces: A new groupwise paradigm for estimation, analysis, and reconstruction. Medical image analysis 45, 1–12 (2018)
 [10] Yang, L., Georgescu, B., Zheng, Y., et al.: Prediction based collaborative trackers (pct): A robust and accurate approach toward 3d medical object tracking. IEEE transactions on medical imaging 30(11), 1921–1932 (2011)