Time-Dependent Deep Image Prior for Dynamic MRI
We propose a novel unsupervised deep-learning-based algorithm to solve the inverse problem found in dynamic magnetic resonance imaging (MRI). Our method needs neither prior training nor additional data; in particular, it does not require either electrocardiogram or spokes-reordering in the context of cardiac images. It generalizes to sequences of images the recently introduced deep-image-prior approach. The essence of the proposed algorithm is to proceed in two steps to fit k-space synthetic measurements to sparsely acquired dynamic MRI data. In the first step, we deploy a convolutional neural network (CNN) driven by a sequence of low-dimensional latent variables to generate a dynamic series of MRI images. In the second step, we submit the generated images to a nonuniform fast Fourier transform that represents the forward model of the MRI system. By manipulating the weights of the CNN, we fit our synthetic measurements to the acquired MRI data. The corresponding images from the CNN then provide the output of our system; their evolution through time is driven by controlling the sequence of latent variables whose interpolation gives access to the sub-frame—or even continuous—temporal control of reconstructed dynamic images. We perform experiments on simulated and real cardiac images of a fetus acquired through 5-spoke-based golden-angle measurements. Our results show improvement over the current state-of-the-art.READ FULL TEXT VIEW PDF
Time-Dependent Deep Image Prior for Dynamic MRI
There are currently three main approaches to accelerate the magnetic resonance imaging (MRI) of a static image. All three methods rely on a partial sampling of the k-space to reduce the acquisition time. The resulting partial loss of data must then be compensated to maintain the quality of the image. Once compensation is achieved, the accelerated methods capture accurate motions of fast moving organs such as the heart.
In parallel MRI (pMRI), the simultaneous use of several hardware coils results in spatial redundancy that enables algorithms to reconstruct clean images [griswold2002generalized, pruessmann1999sense].
In compressed sensing (CS) MRI, the data are assumed to be sparse in certain transform domains [lustig2007sparse, fessler2010model]. This ultimately leads to regularized computational methods that compensate for the partial sampling. Their success suggests that, in particular, a Fourier-based forward model matches well the assumption of sparsity.
In the context of trainable deep artificial neural networks, learning approaches have already achieved fast and accurate reconstructions of partially sampled MRI data [lecun2015deep, krizhevsky2012imagenet]. Similarly to CS MRI, dynamic accelerated reconstructions have also been proposed in the literature [schlemper2018deep, hauptmann2019real, biswas2019dynamic], possibly in combination with pMRI in the learning loop [hammernik2018learning]. These approaches depend on training datasets [sun2016deep, wang2016accelerating, han2018deep, mardani2019deep, jin2017deep, tezcan2018mr].
In the context of dynamic MRI, the approach that consists in the acquisition of a sequence of frames is suboptimal. Instead, it is more efficient to take advantage of the time dependencies between frames to gain additional improvements in terms of temporal resolution [jung2009k, lingala2011accelerated]. For instance, [feng2014golden, feng2016xd] design non-overlapping sampling masks at each frame to restore a dynamic volume—a method that demands far fewer k-space samples than would a sequence of static MRI. Indeed, the CS theory explains the possibility of perfect reconstruction despite a sub-Nyquist sampling [candes2006uncertainty]. The capture of temporal redundancies has also been handled through low-dimensional manifolds [poddar2015dynamic, nakarmi2017ker]. In the specific case of cardiac applications, the overall motion of the heart is expected to be approximately cyclic. This periodicity can be exploited to bring additional gains in terms of temporal resolution, but the length and phase of the cycles must be determined first. This is usually achieved either through electrocardiograms (ECG) or self-gating [yerly2016coronary, chaptinel2017fetal]. Under the restrictive assumption of ideal periodicity, these methods allow one to prefix the cardiac phases and to reorder temporally the acquired frames, effectively producing a stroboscope-inspired analysis of the motion. Motion irregularities are not captured by those methods.
In this paper, we propose an unsupervised learning framework in which a generative network reconstructs fast dynamic MRI from golden-angle radial lines in k-space, also called spokes. To reconstruct a single image, we feed one realization of low-dimensional latent variables to a generative artificial neural network. A nonuniform fast Fourier transform111https://github.com/marchdf/python-nufft (NuFFT) is then applied to the output of the generative network and simulates the MRI measurement process . Inspired by deep image priors [ulyanov2018deep], we fit the simulated measurements to the real measurements. The fit is controlled by adjusting the weights of the generative network until the Euclidean loss between the simulated and real measurements is minimized; this fitting process is referred to as the learning stage.
In the context of dynamic MRI, we extend the fitting process in such a way that the weights of the network are learned by taking simultaneously into consideration the joint collection of all acquisitions, which yields time-independent weights. Time dependence is recovered by controlling the latent variables. Given some temporal interval, we synthesize two independent random realizations of low-dimensional latent variables and associate one to the initial and one to the final bound of the interval. Timestamped contributions to learning are obtained by taking as ground-truth the real measurements acquired over a quasi-instantaneous observation period (say, five spokes), while we let the activation of the generative network be the intermediate realization of the latent variables obtained by linear interpolation of the two latent endpoints. This approach allows us to impose and exploit temporal dependencies in the latent space.
In short, the action of our generative model is to map the manifold of latent variables onto a manifold of dynamic images. Importantly, our approach is purely unsupervised; moreover, priors are imposed only indirectly, arising from the mere structure of a convolutional network. We demonstrate the performance of our generative network by comparing its reconstructions to those obtained from CS algorithms [feng2014golden, feng2016xd].
Deep image priors have been introduced in [ulyanov2018deep]. They capitalize on the structure of convolutional artificial neural networks to adjust priors to the data, thus avoiding the limitations and pitfalls of hand-crafted priors. They have been deployed in [yazdanpanah2019non] to build an unsupervised learning scheme for accelerated MRI; but, contrarily to ours, the task addressed therein is static. Other researchers have used deep image priors to reconstruct positron emission tomography images, albeit again in a non-dynamic fashion [gong2018pet].
Let be the Radon transform of the complex-valued continuously defined image , so that
where and are the spatial and angular Radon arguments, respectively, and where
is a unit vector in thedirection. Moreover, let denote the one-dimensional continuous Fourier transform that follows the convention
at spatial pulsation
, otherwise known as a coordinate in k-space. Then, our conceptual model of non-Cartesian MRI is the concatenated linear transform
which maps a two-dimensional static image onto its measurements at continuously defined direction and pulsation . Mathematically, is invertible because so are and . Consequently, provided we know for every direction and every pulsation , we can in principle recover the value of at every spatial argument .
Unfortunately, can be known in practice only at finitely many discrete directions and pulsations. This discretization, which amounts to modeling MRI by a NuFFT, makes the discrete transform an unfaithful surrogate of our conceptual model, in particular because of aliasing concerns, and also because the discrete version is no more invertible.
Formally, let be a vectorized version of the samples of seen as an image of finite size , with . Likewise, let be a vectorized version of the samples of the sinogram , with measurements of finite size taken over orientations and elements of the k-space, with . Then, by linearity of the transformation we write that
where is an -rows by -columns matrix that combines discrete Fourier and discrete Radon transforms.
In dynamic MRI, it is acknowledged that the image changes continuously through time. We assume however that the measurements of a spoke (a radial line in k-space, as suggested in Figure 1) are instantaneous and indexed at discrete times , taken regularly at temporal interval . The spoke orientations follow the golden-angle strategy
where gives the orientation of a spoke at continuous time , with its angular velocity. The golden-angle specificity is the irrationality condition , which is approximated by setting [feng2014golden]. We finally denote an image frame at time as and its spatially discretized version of length as . Its associated sinogram is . As it turns out, however, it is only natural to set for the discretization of because of our assumption of ideal temporal sampling. Then, is a direct representation of a spoke and has length . Its dependence on is encoded in the time-dependent -rows by -columns system matrix , with .
In accelerated dynamic MRI, one acquires for and wants to reconstruct or, possibly, for . Clearly, however, a single orientation does not provide sufficient data for the the recovery of the two-dimensional . To overcome this issue, we assume next that the changes are slow over some small , so that for all in the half-open interval
. In practice, the oddcorresponds to the number of radial lines used for reconstruction and is related to the temporal resolution of dynamic imaging. We then collect neighboring spokes222In the practical case of finite-time acquisitions, it may happen that no is available for some . In this case, which happens near the beginning and the end of the temporal range, we use one-sided nearest spokes. and concatenate them in the vector of length . Through this mechanism, there are spokes that are shared between, say, and .
The dependence of on is encoded now in the time-dependent -rows by -columns matrix , so that (4) becomes time-dependent by writing that
Because of the irrationality condition of the golden-angle approach, no direction will ever be measured twice and acquires effective time dependence.
Even with , it is observed that , which makes severely ill-posed the recovery of given . To truly resolve this issue, practitioners often choose to regularize the problem over some extended temporal range. From a notational perspective, vectors are concatenated over a large duration to build . Likewise, we write that and . The length of and are and , respectively. The size of ensues.
In the context of CS dynamic imaging, the traditional regularization of the forward model (6) is established as a search for the solution
where is a sparsifying transform along the temporal domain. Typically, this transform is a finite-difference operator, used as surrogate of the conceptual first-order derivative . The corresponding regularization term encourages temporal dependency and counterbalances the ill-posedness of (6).
By contrast, there exists no explicit regularizer in the context of dynamic MRI reconstruction by traditional neural networks [schlemper2018deep, qin2019convolutional]
. In return, image priors are data-driven, imposed by the supervised learning of ground-truth pairs. Letting represent the function that the network implements, where gives network parameters such as weights and biases, learning will return the solution
There, is the Hermitian transpose of . With some abuse of notation, the learning process is represented by the expectation operator . The set provides the learning data. After learning has converged to some , reconstructed images are obtained as .
Supervised learning cannot be performed in the absence of ground truth; unsupervised learning methods must be used instead. To that effect, generative networks associated with deep image priors have been proposed in [radford2015unsupervised], while a cost function that is appropriate to unsupervised learning has been developed in [ulyanov2018deep]. In this paper, we propose an extension whereby we address the needs of dynamic MRI by performing temporal interpolation of the latent variables. The sketch of our approach is provided in Figure 2.
As introduction, we address first the static applications of inpainting and denoising. There, the purpose of deep-image-prior algorithms is to reconstruct a clean signal given the perturbed measurement . The generative network with optimal parameter
minimizes a data-fidelity term characterized by the forward model , which is assumed to be known. The -dimensional latent variable is maintained at a fixed value during the whole optimization process. Often, the minimizer
is calculated using some stochastic gradient-descent method with random initial parameter[kingma2014adam]. In summary, deep-image-prior methods achieve strong priors from the structure of a generator network and capture advanced image statistics in a purely unsupervised fashion.
Our main contribution in this paper is to use interpolated latent variables as inputs of a generative neural network. We start by obtaining two realizations of random discrete images; these realizations are kept for the entire duration of the procedure. The first realization takes the role of the latent variable associated to the beginning of the dynamic MRI sequence. The second realization is denoted by and is associated to the end of the sequence. Intermediate realizations are built like shown in Figure 3 and are also used as latent variables.
Choosing the dimension of the latent variables to be small, we conjecture that the linearly interpolated latent variables span a low-dimensional, smooth manifold. In other words, we interpret our proposal as a way to impose data-driven temporal dependency in the latent space. A generative convolutional neural network will then transfer this low-dimensional manifold to some corresponding low-dimensional manifold in the MRI space, the central tenet of this work being that the frames from dynamic imaging do span a low-dimensional manifold, too.
For a single coil, our deep prior minimizes an Euclidean loss and results in the solution
where represents the deep generative network of parameter . For coils in pMRI, we establish the solution
where gives the sensitivity map of the th coil, is a pixel-wise multiplication operator in the spatial domain which relates true magnetization image to coil sensitivities, and concatenates instantaneous acquisitions of spokes for the th coil. Once an optimal has been found in either (10) or (11
), we can produce the final estimatefor all values of , including for if desired, as
Number of Filters
Size of Each Filter (XYC)
Zero Padding (XY)
Size of Output Image (XYC)
We design our generative network as shown in Table I
. The generative network consists of convolutional layers, batch normalization layers, ReLU, and nearest-neighbor interpolations. We apply zero-padding before convolution to let the size of the output mirror that of the input. At the last layer, ReLU is not used. The input of the network is a small-size random variable generated from the uniform distribution, as explained in Section II-D2. The output has two channels because MRI images take complex values.
All experimental datasets are breath-hold. We use golden-angle radial sparse parallel (GRASP) MRI as a common baseline [feng2014golden]. Spoke-sharing is not applied for GRASP. We assume two-fold upsampling of measurements for every dataset. Therefore, the size of the reconstructed fields of view is half that of the first dimension of the measurements.
Retrospective Simulation A cardiac cine data set was acquired using a 3T whole-body MRI scanner (Siemens; Tim Trio) equipped with a 32-element cardiac coil array. The acquisition sequence was bSSFP and prospective cardiac gating was used. The imaging parameters were as follows: FOV=, acquisition matrix size=, TE/TR=, receiver bandwidth=, and flip angle=. The number of cardiac phases was and the temporal resolution was . We then used NuFFT to implement the forward model in a golden-angle context, resulting in fully sampled Cartesian trajectories. Then, sinograms were obtained as shown in Figure 1. The number of spokes per phase is . The dimension of sinograms is . In this simulation, the cardiac motion is discrete; thus, no spoke-sharing strategy is applied.
Golden-angle Reconstruction of Fetal Cardiac Motion Fetal cardiac MRI data were acquired on a 1.5 T clinical MR scanner (MAGNETOM Aera, Siemens AG, Healthcare Sector, Erlangen, Germany) with an 18-channel body array coil and a 32-channel spine coil for signal reception. Images were acquired with an untriggered continuous 2D bSSFP sequence that was modified to acquire radial readouts with a golden-angle trajectory [chaptinel2017fetal]. The acquisition parameters were: FOV = 260 260 , acquisition matrix size = 256 256 pixels, slice thickness = 4.0 mm, TE/TR = 1.99/4.1 ms, RF excitation angle = 70, radial readouts = 1600, acquisition time = 6.7 s and bandwidth = 1028 Hz/pixel.
We use an Intel i7-7820X (3.60GHz) CPU and an Nvidia Titan X (Pascal) GPU. Pytorch 1.0.0 on Python 3.6 is used to implement our generative model333We shall provide a link to the repository upon paper acceptance.. All experiments are performed in single-batch mode. The input size is . The cost function used to train our generative network is (11). The learning rate is , with [kingma2014adam] as optimizer.
Retrospective Simulation The number of iterations is . The scheduling for the learning rate is multiplier per iterations. We set .
Golden-angle Reconstruction of Fetal Cardiac Motion The number of iterations is . No scheduling for the learning rate is applied. We used cycles of latent variables. We set to obtain the results shown in Figure 6.
We use the regressed signal-to-noise ratio (RSNR) as a quantitative metric in our retrospective simulations. With the oracle and the reconstructed image, RSNR is given by
where a higher RSNR corresponds to a better reconstruction.
For real datasets, a CS approach with self-gated signals takes the role of a state-of-the-art baseline [yerly2016coronary, chaptinel2017fetal].
In this experiment, the acquisition process is simulated, which allows us to build the ground truth from a fully sampled k-space. We use spokes for the reconstruction and present the results in Figure 4. There, the bandpass method (BP) corresponds to a zero-filled DFT while GRASP is the baseline against which we compare the performance of our method. We see in Figure 4 (A) that GRASP leads to blurring artifacts, while the residual map discloses the occurrence of errors around the wall of the heart. By contrast, our proposed method gives better results. This is confirmed in Figure 4 (B), where the cardiac motions are captured better by our model than by GRASP. The systolic phase of our reconstruction is well described and very close to the ground-truth, whereas the systolic phase captured by GRASP is too flat.
In this experiment, the acquisition process is real, so that no ground truth is available. We use spokes for most reconstructions ( for GRASP) and present the results in Figure 5. The synthesis of the sequence of latent variables is explained in Section IV-A. The overlapped method (OV) corresponds to frames generated from all spokes, while the reordered method (RD) is able to reconstruct multiple cardiac phases by reordering k-t sparse SENSE by self-gated signals [yerly2016coronary, chaptinel2017fetal]. Self-gating, driven by correlation coefficients between approximately reconstructed images, is necessary because it is impractical to capture the electrocardiogram signals of a fetal heart. This ultimately prevents the reordering of golden angles in accordance with their phase.
In the absence of ground truth, we shall take OV to be the reference image for navigation purposes. However, because OV considers all spokes simultaneously, it is a static image that is of high quality only in regions that are not moving. We see in Figure 5 (A) that BP and GRASP are noisier than RD and our method. Comparing now RD to our approach, we see that ours produces better-resolved features, particularly for the hyper-intense dot-like structures.
In Figure 5 (B), it becomes apparent that BP fails altogether to capture the fetal cardiac beats, while GRASP is less noisy but still mostly fails to reconstruct motions. RD fares better; unfortunately, its reordering process may lead it to superpose in the same frame spokes that may belong indeed to different cardiac cycles. By contrast, our method444For display purposes, we show only one cycle of our cross section. In fact, our reconstructed data have as many frames () than there are spokes, in reason of the spoke-sharing mechanism of Section II-B. reconstructs each frame with data from just a few neighboring spokes, thus avoiding the mingling of different cycles. Consequently, we expect our reconstructed systolic phase to capture well the true motion of the heart. The cross section from our method is similar to that of RD but the motion is smoother in our case, which is the expected behavior of a beating heart.
We provide in Figure 6 (A) our complete reconstructed sequence of cardiac cycles, in the form of a (y-t) cross section. The quasi-periodicity of the cardiac motion is clearly visible along the temporal axis, while motion variations can be discerned from cycle to cycle. In Figure 6 (B), we also explore over the image and latent domains the structure of the manifold of the cardiac motion. The visualization proceeds through a (t-SNE) embedding [maaten2008visualizing]. We observe that the continuous trajectory for the image manifold is well aligned with the temporal index, while the input latent variables lie on a smooth manifold in the latent space. The latent variables that correspond to Figure 6 (B) have been cut in fourteen chunks; the reason for this is explained in Section IV-A. The importance of smoothness in the latent variables is described in Section IV-B.
Letting in Section II-C be such that truly all data in Figure 6 (A) were taken jointly, and interpolating the latent variables between the only two endpoint realizations and , we observed that the reconstruction of the fetal cardiac motion took a constant, time-independent value. We surmise that this failure is due to the overly strong presence of non-periodic components in the data. To cope with them, we adapted our scheme slightly and proceeded by piecewise interpolation, the pieces being made of temporal chunks in the latent space. More precisely, we generated fourteen realizations of the latent variables, equi-spaced in time; then, instead of building as a linear combination of and , we built as a linear combination of and , with an appropriate that depends on . Note that, while the latent variables evolve now chunk-wise, the network is still time-independent and trained over all data jointly. The chunk boundaries are made visible in Figure 6 (B).
To explore the importance of smoothness in the manifold of latent variables, we have trained separately two generative networks, each with a different configuration for the latent variables. On one hand, we have considered independent realizations of random latent variables at each frame, which resulted in an uncorrelated, non-smooth latent spatio-temporal manifold. On the other hand, we have followed the interpolation procedure of Figure 3. After examination of the outcome shown in Figure 7, we conclude that the non-smooth manifold fails to reconstruct dynamic images, while our proposed approach succeeds.
Until now, we have fixed the size of the latent variables as , which corresponds to . In this section instead, we let vary and report the resulting RSNR in Table II, where we observe an acceptable quality of reconstruction for latent variables of size ranging between and . For larger sizes, the reconstruction are corrupted by significant artifacts. We expected the smallest, size to reflect the fact that time is a one-dimensional variable. Yet, the learning process failed to converge to a desirable solution in this case, the corresponding sequence of reconstructed images taking the constant, time-independent value shown in the last row of Figure 8. In conclusion, this experiment suggests that the size provides a good tradeoff between the dimension of the manifold spanned by the latent variables and the convergence issues inherent with any optimization procedure.
In this section, we explore various scenarii beyond linear interpolation, which we summarize in Figure 8.
Temporal Interpolation In the first scenario, we consider , which corresponds to a fine interpolation of the intermediate latent variables. This gives us access to temporally interpolated images at the output of our generative network.
Temporal Extrapolation In the second scenario, we extrapolate the latent variables outside of the temporal range that was available for learning. This gives distorted images.
New Latent Variables In the third scenario, we make use of random realizations of latent variables that differ from those used while learning. This gives severely perturbed images.
Perturbations In the fourth scenario, we perturb the latent variables by adding uniform noise whose energy amounts to . This setting outputs clean images.
Scalar Latent Variables In the fifth and last scenario, we deploy scalar latent variables. This results in a time-independent, non-moving sequence of images.
In a CS context, the gradient updates of the iterative optimization process would need one to allocate enough memory to hold a whole target reconstruction volume. For example, the reconstruction of frames with spatial size would need one to handle data of size , which demands for over a gigabyte of memory.
By contrast, our approach is much less memory-hungry. It optimizes the generative network using batches, which requires the simultaneous handling of only those frames that correspond to the batch size. In short, the fact that our proposed approach handles few 2D images whereas CS handles a 2D+t extended sequence leads to substantial savings, particularly for golden-angle dynamic MRI with many frames. In our approach, we only save a generative model; for example, its memory demands for the spatial size are about half-a-dozen megabytes. This cost is negligible compared to that of the CS approach.
Several competing methods aim at synthesizing a single cardiac cycle out of spokes collected over several cycles at various cardiac phases. There, a synchronized ECG could in principle allow one to associate a phase to each spoke; however, the deployment of this ancillary ECG would make the practical apparatus more involved, which is often undesirable. Furthermore, there are applications such as fetal cardiology where no ECG can be deployed at all. In traditional ECG-free approaches instead, one deploys self-gating methods for the purpose of phase assignment. They proceed on the basis of either human inspection or heuristic decisions, which makes them arduous, non-reproducible, and prone to errors. (Sometimes, the assignment is no more advanced than a simple manual sorting.) One specific additional difficulty that self-gating methods must deal with originates with the necessary Cartesian-to-polar conversion inherent in radial sampling trajectories, which ultimately results in streaking artifacts that tend to confound phase assignments, particularly those based on visual assessments in the spatial domain.
By contrast, in our proposed approach we relax the hypothesis of ideal periodicity. As presented in Section IV-A, this allows us to take into consideration cycle-to-cycle variations, thus providing access to clinical insights that are not available with traditional accelerated methods. Moreover, our ECG-free approach has the major benefit that no phase assignment is needed, thus providing access to a continuously defined time variable. Being fully automated, our approach is also reproducible. Another advantage is that the streaking artifacts associated to radial trajectories play no role since the reference data used for training in (10) or (11) are the spokes themselves, as opposed to reconstructed images. Finally, when compared to CS, our approach achieves better reconstruction, with a gain of , as seen in Figure 4. It also leads to a simpler optimization task with fewer hyper-parameters. For instance, k-t SENSE requires three interdependent hyper-parameters whose optimal value is found only after some substantial grid-search effort, while the two hyper-parameters of our approach are easier to interpret since they trivially consist of just an initial learning rate, along with a number of iterations.
One major limitation of our proposed approach is its computational complexity. At times, no fewer than iterations are required before convergence is observed. This imposes a large computational burden since every iteration involves both forward and adjoint operations. From a technological perspective, we implemented the forward model in Python 3.6, with Pytorch (v1.0.0) as main library. Unfortunately, NuFFT is currently optimized neither for Python nor for GPU usage, which lead to a marked slowdown of our implementation. For instance, our method spent a whole day letting one GPU (TITAN X) process the cardiac dataset presented in Section II-E2, whereas GRASP terminated within ten minutes. This bottleneck will be improved by a better integration of the NuFFT library.
In our future work, we want to explore the dynamics of the latent variables. In this paper indeed, we simply took advantage of linear interpolation to build trivial intermediate states of the latent variables; however, refined approaches may better capture the temporal correlation between frames.
We propose to combine unsupervised learning with a generative network model. There, the latent variables are interpolated through time before being input to our generative network. This framework fits well the learning of spatio-temporal manifolds that are smooth temporally; it is purely unsupervised. It is also particularly appropriate in the context of inverse problems where no ground-truth is available. In practice, it results in significant memory savings when compared to compressed-sensing (CS) approaches. Our study shows that our proposed method has the potential to reconstruct dynamic magnetic resonance images (MRI) in the absence of an electrocardiogram signal. Quality-wise, it typically outperforms state-of-the-art CS approaches. To the best of our knowledge, ours is the first approach of unsupervised learning in accelerated dynamic MRI.
The authors would like to thank Prof. Jong Chul Ye at KAIST for providing the bSSFP cardiac MRI k-space data set (retrospective dataset).