I Introduction
Magnetic Resonance Imaging (MRI) is a noninvasive imaging technique which offers excellent spatial resolution and soft tissue contrast and is widely used for clinical diagnosis and research. Dynamic MRI attempts to reveal both spatial and temporal profiles of the underlying anatomy, which has a variety of applications such as cardiovascular imaging and perfusion imaging. However, the acquisition speed is fundamentally limited due to both hardware and physiological constraints as well as the requirement to satisfy the Nyquist sampling rate. Long acquisition times are not only a burden for patients but also make MRI susceptible to motion artefacts.
In order to accelerate MRI acquisition, most approaches consider undersampling the data in space (frequency domain). Due to the violation of the Nyquist sampling theorem, undersampling introduces aliasing artefacts in the image domain. Images can be subsequently reconstructed by solving an optimisation problem that regularises the solution with assumptions on the underlying data, such as smoothness, sparsity or, for the case of dynamic imaging, spatiotemporal redundancy. Past literature has shown that exploiting spatiotemporal redundancy can greatly improve image reconstruction quality compared to compressed sensing (CS) based single frame reconstruction methods [1, 2]. However, the challenges of these optimisation based approaches are the following: firstly, the regularisation functions and their hyperparameters must be carefully selected, which are problemspecific and nontrivial. For example, overimposing sparsity or penalties can lead to cartoonlike/staircase artefacts. Secondly, the reconstruction speeds of these methods are often slow due to requirement to solve iterative algorithms. Proposing a robust iterative algorithm is still an active area of research.
In comparison, deep learning methods are gaining popularity for their accuracy and efficiency. Unlike traditional approaches, the prior information and regularisation are learnt implicitly from data, without having to specify them in the training objective. However, so far only a handful of approaches exist [3, 4] for dynamic reconstruction. Hence, the applicability of deep learning models to this problem is yet to be fully explored. In addition, many proposed deep learning architectures are often generic and are not optimised for specific applications. In particular, a core question for dynamic reconstruction is how to optimally exploit spatiotemporal redundancy. By designing a network architecture and regulating the mechanics of network layers to efficiently learn such spatiotemporal data representation, the network should gain a boost in performances.
In this work, we propose a novel convolutional recurrent neural network (CRNN) method to reconstruct high quality dynamic MR image sequences from undersampled data, termed CRNNMRI. Firstly, we formulate a general optimisation problem for solving accelerated dynamic MRI based on variable splitting and alternate minimisation. We then show how this algorithm can be seen as a network architecture. In particular, the proposed method consists of a CRNN block which acts as the proximal operator and a data consistency layer corresponding to the classical data fidelity term. In addition, the CRNN block employs recurrent connections across each iteration step, allowing reconstruction information to be shared across the multiple iterations of the process. Secondly, we incorporate bidirectional convolutional recurrent units evolving over time to exploit the temporal dependency of the dynamic sequences and effectively propagate the contextual information across time frames of the input. As a consequence, the unique CRNN architecture jointly learns representations in a recurrent fashion evolving over both time sequences as well as iterations of the reconstruction process, effectively combining the benefits of traditional iterative methods and deep learning.
To the best of our knowledge, this is the first work applying RNNs for dynamic MRI reconstruction. The contributions of this work are the following: Firstly, we view the optimisation problem of dynamic data as a recurrent network and describe a novel CRNN architecture which simultaneously incorporates the recurrence existing in both temporal and iteration sequential steps. Secondly, we demonstrate that the proposed method shows promising results and improves upon the current stateoftheart dynamic MR reconstruction methods both in reconstruction accuracy and speed. Finally, we compare our architecture to 3D CNN which does not impose the recurrent structure. We show that the proposed method outperforms the CNN at different undersampling rates and speed, while requiring significantly fewer parameters.
Ii Related Work
One of the main challenges associated with recovering an uncorrupted image is that both the undersampling strategy and apriori knowledge of appropriate properties of the image need to be taken into account. Methods like kt BLAST and kt SENSE [5] take advantage of apriori information about the xf support obtained from the training data set in order to prune a reconstruction to optimally reduce aliasing. An alternative popular approach is to exploit temporal redundancy to unravel from the aliasing by using CS approaches [1, 6] or CS combined with lowrank approaches [2, 7]. The class of methods which employ CS to the MRI reconstruction is termed as CSMRI [8]. They assume that the image to be reconstructed has a sparse representation in a certain transform domain, and they need to balance sparsity in the transform domain against consistency with the acquired undersampled kspace data. For instance, an example of successful methods enforcing sparsity in xf domain is kt FOCUSS [1]. A low rank and sparse reconstruction scheme (kt SLR) [2] introduces nonconvex spectral norms and uses a spatiotemporal total variation norm in recovering the dynamic signal matrix. Dictionary learning approaches were also proposed to train an overcomplete basis of atoms to optimally sparsify spatiotemporal data [6]
. These methods offer great potential for accelerated imaging, however, they often impose strong assumptions on the underlying data, requiring nontrivial manual adjustments of hyperparameters depending on the application. In addition, it has been observed that these methods tend to result in blocky
[9] and unnatural reconstructions, and their reconstruction speed is often slow. Furthermore, these methods are not able to exploit the prior knowledge that can be learnt from the vast number of MRI exams routinely performed, which should be helpful to further guide the reconstruction process.Recently, deep learningbased MR reconstruction has gained popularity due to its promising results for solving inverse and compressed sensing problems. In particular, two paradigms have emerged: the first class of approaches proposes to use convolutional neural networks (CNNs) to learn an endtoend mapping, where architectures such as SRCNN [10] or Unet [11] are often chosen for MR image reconstruction [12, 13, 14, 15]. The second class of approaches attempts to make each stage of iterative optimisation learnable by unrolling the endtoend pipeline into a deep network [9, 16, 17, 18, 19]. For instance, Hammernik et al. [9] introduced a trainable formulation for accelerated parallel imaging (PI) based MRI reconstruction termed variational network, which embedded a CS concept within a deep learning approach. ADMMNet [17] was proposed by reformulating an alternating direction method of multipliers (ADMM) algorithm to a deep network, where each stage of the architecture corresponds to an iteration in the ADMM algorithm. More recently, Schlemper et al. [18] proposed a cascade network which simulated the iterative reconstruction of dictionary learningbased methods and were later extended for dynamic MR reconstructions [3]. Most approaches so far have focused on 2D images, whereas only a few approaches exist for dynamic MR reconstruction [3, 4]. While they show promising results, the optimal architecture, training scheme and configuration spaces are yet to be fully explored.
More recently, several methods on 2D MR image reconstruction were proposed [9, 20, 21], which share similar idea with our proposed method that integrates data fidelity term and regularisation term into a single deep network so that to enable the endtoend training. In contrast to these methods which use shared parameters over iterations, as we will show, our architecture integrates hidden connections over optimisation iterations to propagate learnt representations across both iteration and time, whereas such information is discarded in the other methods. Such proposed architecture enables the information used for the reconstruction at each iteration to be shared across all stages of the reconstruction process, aiming for an iterative algorithm that can fully benefit from information extracted at all processing stages. As to the nature of the proposed RNN units, previous work involving RNNs only updated the hidden state of the recurrent connection with a fixed input [22, 23, 24], while the proposed architecture progressively updates the input as the optimisation iteration increases. In addition, previous work only modelled the recurrence of iteration or time [25] exclusively, whereas the proposed method jointly exploits both dimensions, yielding a unique architecture suitable for the dynamic reconstruction problem.
Iii Convolutional Recurrent Neural Network for MRI reconstruction
Iiia Problem Formulation
Let
denote a sequence of complexvalued MR images to be reconstructed, represented as a vector with
, and let represent the undersampled kspace measurements, where and are width and height of the frame respectively and stands for the number of frames. Our problem is to reconstruct from , which is commonly formulated as an unconstrained optimisation problem of the form:(1) 
Here is an undersampling Fourier encoding matrix, expresses regularisation terms on and allows the adjustment of data fidelity based on the noise level of the acquired measurements . For CS and lowrank based approaches, the regularisation terms often employed are or norms in the sparsifying domain of as well as the rank or nuclear norm of respectively. In general, Eq. 1 is a nonconvex function and hence, the variable splitting technique is usually adopted to decouple the fidelity term and the regularisation term. By introducing an auxiliary variable that is constrained to be equal to , Eq. 1 can be reformulated to minimize the following cost function via the penalty method:
(2) 
where is a penalty parameter. By applying alternate minimisation over and , Eq. 2 can be solved via the following iterative procedures:
(3a)  
(3b) 
where is the zerofilled reconstruction taken as an initialisation and can be seen as an intermediate state of the optimisation process. For MRI reconstruction, Eq. 3b is often regarded as a data consistency (DC) step where we can obtain the following closedform solution [18]:
(4) 
in which
is the full Fourier encoding matrix (a discrete Fourier transform in this case),
is a ratio of regularization parameters from Eq. 4, is an index set of the acquired space samples and is a diagonal matrix. Please refer to [18] for more details of formulating Eq. 4 as a data consistency layer in a neural network. Eq. 3a is the proximal operator of the prior , and instead of explicitly determining the form of the regularisation term, we propose to directly learn the proximal operator by using a convolutional recurrent neural network (CRNN).Previous deep learning approaches such as DeepADMM net [17] and method proposed by Schlemper et al. [18] unroll the traditional optimisation algorithm. Hence, their models learn a sequence of transition to reconstruct the image, where each state transition at stage is an operation such as convolutions independently parameterised by , nonlinearities or a data consistency step. However, since the network implicitly learns some form of proximal operator at each iteration, it may be redundant to individually parameterise each step. In our formulation, we model each optimisation stage as a learnt, recurrent, forward encoding step . The difference is that now we use one model which performs proximal operator, however, it also allows itself to propagate information across iteration, making it adaptable for the changes across the optimisation steps. The detail will be discussed in the following section. The different strategies are illustrated in Fig 1.
IiiB CRNN for MRI reconstruction
RNN is a class of neural networks that makes use of sequential information to process sequences of inputs. They maintain an internal state of the network acting as a "memory", which allows RNNs to naturally lend themselves to the processing of sequential data. Inspired by iterative optimisation schemes of Eq. 3, we propose a novel convolutional RNN (CRNN) network. In the most general scope, our neural encoding model is defined as follows,
(5) 
in which denotes the prediction of the network, is the sequence of undersampled images with length and also the input of the network, is the network function for each iteration of optimisation step, and is the number of iterations. We can compactly represent a single iteration of our network as follows:
(6a)  
(6b) 
where CRNN is a learnable block explained hereafter, DC is the data consistency step treated as a network layer, is the progressive reconstruction of the undersampled image at iteration with , is the intermediate reconstruction image before the DC layer, and is the acquired kspace samples. Note that the variables are analogous to in Eq. 3 respectively. Here, we use CRNN to encode the update step, which can be seen as one step of a gradient descent in the sense of objective minimisation, or a more general approximation function regressing the difference , i.e. the distance required to move to the next state. Moreover, note that in every iteration, CRNN updates its internal state given an input which is discussed shortly. As such, CRNN also allows information to be propagated efficiently across iterations, in contrast to the sequential models using CNNs which collapse the intermediate feature representation to .
In order to exploit the dynamic nature and the temporal redundancy of our data, we further propose to jointly model the recurrence evolving over time for dynamic MRI reconstruction. The proposed CRNNMRI network and CRNN block are shown in Fig. 2(a), in which CRNN block comprised of 5 components:

bidirectional convolutional recurrent units evolving over time and iterations (BCRNNti),

convolutional recurrent units evolving over iterations only (CRNNi),

2D convolutional neural network (CNN),

DC layers.
We introduce details of the components of our network in the following subsections.
IiiB1 CRNNi
As aforementioned, we encapsulate the iterative optimisation procedures explicitly with RNNs. In the CRNNi unit, the iteration step is viewed as the sequential step in the vanilla RNN. If the network is unfolded over the iteration dimension, the network can be illustrated as in Fig. 2(b), where information is propagated between iterations. Here we use to denote the feature representation of our sequence of frames throughout the network. denotes the representation at layer (subscript) and iteration step (superscript). Therefore, at iteration , given the input and the previous iteration’s hidden state , the hidden state at layer of a CRNNi unit can be formulated as:
(7) 
Here represents convolution operation, and represent the filters of inputtohidden convolutions and hiddentohidden recurrent convolutions evolving over iterations respectively, and represents a bias term. Here is the representation of the whole sequence with shape (), where
is the number of channels which is 2 at the input and output but is greater while processing inside the network, and the convolutions are computed on the last two dimensions. The latent features are activated by the rectifier linear unit (ReLU) as a choice of nonlinearity, i.e.
.The CRNNi unit offers several advantages compared to independently unrolling convolutional filters at each stage. Firstly, compared to CNNs where the latent representation from the previous state is not propagated, the hiddentohidden iteration connections in CRNNi units allow contextual spatial information gathered at previous iterations to be passed to the future iterations. This enables the reconstruction step at each iteration to be optimised not only based on the output image but also based on the hidden features from previous iterations, where the hidden connection convolutions can "memorise" the useful features to avoid redundant computation. Secondly, as the iteration number increases, the effective receptive field of a CRNNi unit in the spatial domain also expands whereas CNN resets it at each iteration. This property allows the network to further improve the reconstruction by allowing it to have better contextual support. In addition, since the weight parameters are shared across iterations, it greatly reduces the number of parameters compared to CNNs, potentially offering better generalization properties.
In this work, we use a vanilla RNN [26]
to model the recurrence due to its simplicity. Note this can be naturally generalised to other RNN units, such as long shortterm memory (LSTM) and gated recurrent unit (GRU), which are considered to have better memory properties, although using these units would significantly increase computational complexity.
IiiB2 BCRNNti
Dynamic MR images exhibit high temporal redundancy, which is often exploited as apriori knowledge to regularise the reconstruction. Hence, it is also beneficial for the network to learn the dynamics of sequences. To this extent, we propose a bidirectional convolutional recurrent unit (BCRNNti) to exploit both temporal and iteration dependencies jointly. BCRNNti includes three convolution layers: one on the input which comes into the unit from the previous layer indicated by the green arrows in Fig. 2(c), one on the hidden state from the past and future time frames as shown by the red arrows, and the one on the hidden state from the previous iteration of the unit (blue arrows in Fig. 2(c)). Note that we simultaneously consider temporal dependencies from past and future time frames, and the encoding weights are shared for both directions. The output for the BCRNNti layer is obtained by summing the feature maps learned from both directions. The illustration figure of the unit when it is unfolded over time sequence is shown in Fig. 2(c).
As we need to propagate information along temporal dimensions in this unit, here we introduce an additional index in the notation to represent the variables related with time frame . Here represents feature representations at th layer, time frame , and at iteration , denotes the representations calculated when information is propagated forward inside the BCRNNti unit, and similarly, denotes the one in the backward direction. Therefore, for the formulation of BCRNNti unit, given (1) the current input representation of the th layer at time frame and iteration step , which is the output representation from ()th layer
, (2) the previous iteration’s hidden representation within the same layer
, (3) the hidden representation of the past time frame , and the hidden representation of the future time frame , then the hidden state representation of the current th layer of time frame at iteration , with shape , can be formulated as:(8)  
Similar to the notation in Section IIIB1, represents the filters of recurrent convolutions evolving over time. When and , , that is the th frame of undersampled input data, and when and , , which stands for the th frame of the intermediate reconstruction result from iteration . For , and , they are set to be zero initial hidden states.
The temporal connections of BCRNNti allow information to be propagated across the whole time frames, enabling it to learn the differences and correlations of successive frames. The filter responses of recurrent convolutions evolving over time express dynamic changing biases, which focus on modelling the temporal changes across frames, while the filter responses of recurrent convolutions over iterations focus on learning the spatial refinement across consecutive iteration steps. In addition, we note that learning recurrent layers along the temporal direction is different to using 3D convolution along the space and temporal direction. 3D convolution seeks invariant features across spacetime, hence several layers of 3D convolutions are required before the information from the whole sequence can be propagated to a particular time frame. On the other hand, learning recurrent 2D convolutions enables the model to easily and efficiently propagate the information through time, which also yields fewer parameters and a lower computational cost.
In summary, the set of hidden states for a CRNN block to update at iteration is , for and , where is the total number of layers in the CRNN block and is the total number of time frames.
IiiC Network Learning
Given the training data of inputtarget pairs , the network learning proceeds by minimizing the pixelwise mean squared error (MSE) between the predicted reconstructed MR image and the fully sampled ground truth data:
(9) 
where , , and stands for the number of samples in the training set . Note that the total number of time sequences and iteration steps assumed by the network before performing the reconstruction is a free parameter that must be specified in advance. The network weights were initialised using He initialization [27] and it was trained using the Adam optimiser [28]. During training, gradients were hardclipped to the range of
to mitigate the gradient explosion problem. The network was implemented in Python using Theano and Lasagne libraries.
Iv Experiments
Iva Dataset and Implementation Details
The proposed method was evaluated using a complexvalued MR dataset consisting of 10 fully sampled shortaxis cardiac cine MR scans. Each scan contains a single slice SSFP acquisition with 30 temporal frames, which have a mm field of view and 10 mm thickness. The raw data consists of 32channel data with sampling matrix size , which was then zerofilled to the matrix size . The raw multicoil data was reconstructed using SENSE [29] with no undersampling and retrospective gating. Coil sensitivity maps were normalized to a body coil image and used to produce a single complexvalued reconstructed image. In experiments, the complex valued images were backtransformed to regenerate kspace samples, simulating a fully sampled singlecoil acquisition. The input undersampled image sequences were generated by randomly undersampling the kspace samples using Cartesian undersampling masks, with undersampling patterns adopted from [1]
: for each frame the eight lowest spatial frequencies were acquired, and the sampling probability of
space lines along the phaseencoding direction was determined by a zeromean Gaussian distribution. Note that the undersampling rates are stated with respect to the matrix size of raw data, which is
.The architecture of the proposed network used in the experiment is shown in Fig. 2: each iteration of the CRNN block contains five units: one layer of BCRNNti, followed by three layers of CRNNi units, and followed by a CNN unit. For all CRNNi and BCRNNti units, we used a kernel size and the number of filters was set to for ProposedA and for ProposedB in Table I. The CNN after the CRNNi units contains one convolution layer with and
, which projects the extracted representation back to the image domain which contains complexvalued images expressed using two channels. For all convolutional layers, we used stride
and paddings with half the filter size (rounded down) on both size. The output of the CRNN block is connected to the residual connection, which sums the output of the block with its input. Finally, we used DC layers on top of the CRNN output layers. During training, the iteration step is set to be
, and the time sequence for training is . Note that this architecture is by no means optimal and more layers can be added to increase the ability of our network to better capture the data structures (see Section IVD for comparisons).The evaluation was done via a 3fold cross validation, where for two folds we train on 7 subjects then test on 3 subjects, and for the remaining fold we train on 6 subjects and test on 4 subjects. While the original sequence has size , For the training, we extract patches of size , where is the patch size and the direction of patch extraction corresponds to the frequencyencoding direction. Note that since we only consider Cartesian undersampling, the aliasing occurs only along the phase encoding direction, so patch extraction does not alter the aliasing artefact. Patch extraction as well as data augmentation was performed onthefly, with random affine and elastic transformations on the image data. Undersampling masks were also generated randomly following patterns in [1] for each input. During test time, the network trained on patches is directly applied on the whole sequence of the original image. The minibatch size during the training was set to 1, and we observed that the performance can reach a plateau within backpropagations.
IvB Evaluation Method
We compared the proposed method with the representative algorithms of the CSbased dynamic MRI reconstruction, such as kt FOCUSS [1] and kt SLR [2], and two variants of 3D CNN networks named 3D CNNS and 3D CNN in our experiments. The built baseline 3D CNN networks share the same architecture with the proposed CRNNMRI network but all the recurrent units and 2D CNN units were replaced with 3D convolutional units, that is, in each iteration, the 3D CNN block contain 5 layers of 3D convolutions, one DC layer and a residual connection. Here 3D CNNS refers to network sharing weights across iterations, however, this does not employ the hiddentohidden connection as in the CRNNi unit. The 3D CNNS architecture was chosen so as to make a fair comparison with the proposed model using a comparable number of network parameters. In contrast, 3D CNN refers to the network without weight sharing, in which the network capacity is times of that of 3D CNNS, and approximately 12 times more than that of our first proposed method (ProposedA). For the 3D CNN approaches, the receptive field size is , as the receptive field size is “reset” after each data consistency layer. In contrast, for the proposed method, due to the hidden connections between iterations and bidirectional temporal connections, by tracing the longest path of the convolution layers involved in the forward pass, including both temporal and iterative directions, in theory, the receptive field size is (154 layers of CNNs for the middle frame in a sequence of 30 frames). However, the network still may predominantly relies on local features coming from the partial reconstruction. Nevertheless, the RNN has the ability to exploit the features with larger filter size if needed, which is not the case for 3D CNNs.
Reconstruction results were evaluated based on the following quantitative metrics: MSE, peaktonoiseratio (PSNR), structural similarity index (SSIM) [30] and high frequency error norm (HFEN) [31]. The choice of the these metrics was made to evaluate the reconstruction results with complimentary emphasis. MSE and PSNR were chosen to evaluate the overall accuracy of the reconstruction quality. SSIM put emphasis on image quality perception. HFEN was used to quantify the quality of the fine features and edges in the reconstructions, and here we employed the same filter specification as in [31, 32] with the filter kernel size
pixels and a standard deviation of 1.5 pixels. For PSNR and SSIM, it is the higher the better, while for MSE and HFEN, it is the lower the better.
Method  kt FOCUSS  kt SLR  3D CNNS  3D CNN  ProposedA  ProposedB  
Capacity      338,946  3,389,460  262,020  1,040,132  
MSE  0.592 (0.199)  0.371(0.155)  0.385 (0.124)  0.275 (0.096)  0.261 (0.097)  0.201 (0.074)  
PSNR  32.506 (1.516)  34.632 (1.761)  34.370 (1.526)  35.841 (1.470)  36.096 (1.539)  37.230 (1.559)  
SSIM  0.953 (0.040)  0.970 (0.033)  0.976 (0.008)  0.983 (0.005)  0.985 (0.004)  0.988 (0.003)  
HFEN  0.211 (0.021)  0.161 (0.016)  0.170 (0.009)  0.138 (0.013)  0.131 (0.013)  0.112 (0.010)  
MSE  1.234 (0.801)  0.846 (0.572)  0.929 (0.474)  0.605 (0.324)  0.516 (0.255)  0.405 (0.206)  
PSNR  29.721 (2.339)  31.409 (2.404)  30.838 (2.246)  32.694 (2.179)  33.281 (1.912)  34.379 (2.017)  
SSIM  0.922 (0.043)  0.951 (0.025)  0.950 (0.016)  0.968 (0.010)  0.972 (0.009)  0.979 (0.007)  
HFEN  0.310(0.041)  0.260 (0.034)  0.280 (0.034)  0.215 (0.021)  0.201 (0.025)  0.173 (0.021)  
MSE  1.909 (0.828)  1.237 (0.620)  1.472 (0.733)  0.742 (0.325)  0.688 (0.290)  0.610 (0.300)  
PSNR  27.593 (2.038)  29.577 (2.211)  28.803 (2.151)  31.695 (1.985)  31.986 (1.885)  32.575 (1.987)  
SSIM  0.880 (0.060)  0.924 (0.034)  0.925 (0.022)  0.960 (0.010)  0.964 (0.009)  0.968 (0.011)  
HFEN  0.390 (0.023)  0.327 (0.028)  0.363 (0.041)  0.257 (0.029)  0.248 (0.033)  0.227 (0.030)  
Time  15s  451s  8s  8s  3s  6s 
IvC Results
The comparison results of all methods are reported in Table I, where we evaluated the quantitative metrics, network capacity and reconstruction time. Numbers shown in Table I are mean values of corresponding metrics with standard deviation of different subjects in parenthesis. Bold numbers in Table I indicate the better performance of the proposed methods than the competing ones. Compared with the baseline method (kt FOCUSS and kt SLR), the proposed methods outperform them by a considerable margin at different acceleration rates. When compared with deep learning methods, note that the network capacity of ProposedA is comparable with that of 3D CNNS and the capacity of ProposeB is around one third of that of 3D CNN. Though their capacities are much smaller, both ProposedA and ProposedB outperform 3D CNNS and 3D CNN for all acceleration rates by a large margin, which shows the competitiveness and effectiveness of our method. In addition, we can see a substantial improvement of the reconstruction results on all acceleration rates and in all metrics when the number of network parameters is increased for the proposed method (ProposedB), and therefore we will only show the results from ProposedB in the following. The number of iterations used by the network at test time is set to be the same as the training stage, which is , however, if the iteration number is increased up to , it shows an improvement of 0.324dB on average. Fig. 3 shows the model’s performance varying with the number of iterations at test time. Similarly, visualization results of intermediate steps during the iterations of a reconstruction from 9 undersampling data are shown in Fig. 4, where we can observe the gradual improvement of the reconstruction quality from iteration step 1 to 10, which is consistent with the quantitative results as in Fig. 3.
A comparison of the visualization results of a reconstruction from acceleration is shown in Fig. 5 with the reconstructed images and their corresponding error maps from different reconstruction methods. As one can see, our proposed model (ProposedB) can produce more faithful reconstructions for those parts of the image around the myocardium where there are large temporal changes. This is reflected by the fact that RNNs effectively use a larger receptive field to capture the characteristics of aliasing seen within the anatomy. Their temporal profiles at are shown in Fig. 6. Similarly, one can see that the proposed model has overall much smaller error, faithfully modelling the dynamic data. It could be due to the fact that spatial and temporal features are learned separately in the proposed model while 3D CNN seeks invariant feature learning across space and time.
In terms of speed, the proposed RNNbased reconstruction is faster than the 3D CNN approaches because it only performs convolution along time once per iteration, removing the redundant 3D convolutions which are computationally expensive. Reconstruction time of 3D CNN and the proposed methods reported in Table I were calculated on a GPU GeForce GTX 1080, and the time for kt FOCUSS and kt SLR were calculated on CPU.
IvD Variations of Architecture
In this section we show additional experiments to investigate the variants of the proposed architecture. First, we study the effects of recurrence over iteration and time, separately and jointly. In this study, we performed experiments on data set with undersampling factor 9, and the number of iterations was set to be 2 in order to simplify and speed up the training. Results are shown in Table II, where we present the mean PSNR value via 3fold cross validation. To isolate the effects of both recurrence in the module, we proposed to remove one of the recurrence each time. By removing the recurrence over time, the network architecture degrades to 4 CRNNi + CNN layers, and it doesn’t exploit temporal information in this case. If the recurrence over iterations is removed, the network architecture then becomes BCRNNt + 4 CNN layers, without any hidden connections between iterations. Note that in all architectures, the last CNN layer only has 2 filters, which is used to simply aggregate the latent representation back to image space. Therefore, we employ a simple convolution layer for this. From Table II, it can be observed that by removing any of the recurrent connections, the performance becomes worse compared with the proposed architecture with both recurrence jointly. This indicate that both of these recurrence contribute to the learning of the reconstruction. In particular, it is also been observed that by removing the temporal recurrence, the network’s performance degrades greatly compared with the one removing the iteration recurrence. This can be explained that by removing the temporal recurrence, the problem degrades to a single frame reconstruction, while dynamic reconstruction has been proven to be much better than single frame reconstruction as there exists great temporal redundancies that can be exploited between frames.
Architectures  PSNR (dB) 

4 CRNNi + CNN (only iteration)  21.41 
BCRNNt + 4 CNN (only temporal)  26.62 
BCRNNti + 3 CRNNi + CNN (Proposed)  27.98 
In addition, we performed experiments on some other variants of the architecture, in particular, 4 layers of BCRNNti with one layer of CNN, which has the highest capacity amongst all different combinations. Here we set the number of iterations to be 10. It can be observed that by incorporating temporal recurrent connections over all layers does improve the results over ProposedA due to the more information propagated between frames. However, such design also increases the computations and more significantly, time required for training the network. Considering the trade off between performance and training time as well as the hardware constraints, we chose the particular design proposed. We agree that there could be more versions of the architectures that can lead to better performance and our particular design is by no means optimal. However, here we mainly aim to validate our proposed idea of exploiting both temporal and iterative reconstruction information for the problem, and the proposed architecture is satisfactory to show this.
Architectures  PSNR (dB)  FPT  BPT  Training Time 

4 BCRNNti + CNN  34.18  0.94s  5.97s  96h 
ProposedA  33.28  0.45s  1.39s  38h 
ProposedB  34.38  0.90s  2.59s  58h 
IvE Feature Map Analysis
In this section we study further whether the proposed architecture helps to obtain better feature representations. CRNN (ProposedA), 3DCNN and 3DCNNS all have the subnetworks composed of 5 units/layers with 64 channels for the first four, allowing us to directly compare the th layer of representations of the subnetworks for . From one test subject, we extract the feature representations of the subnetwork across 10 cascades/iterations. By treating each channel as a separate feature map, we obtain 640 feature maps for each layer aggregated across iteration. We use the cosine distance to compute the similarity between these activation maps for . If two feature maps are orthogonal, then and if two feature maps are linearly correlated, then . Geometrically, this supports the interpretation that if the cosine distance is small for all the feature map pairs, then the network is likely to be capturing diverse patterns. The result is summarised in Fig. 7, where the similarity measure is visualised as a matrix, as well as their distributions is plotted for each network.
We can see that for both , the layers from CRNN appears to have geometrically more orthogonal feature maps. One can also observe that in general, layer 1 has higher redundancy compared to layer 4. In particular, the diagonal yellow stripes can be observed for CNNS and CRNN, due to parametersharing for each cascade. This is not observed in 3DCNN, even though many features do have high similarity. In Fig. 8 we show examples of the feature maps from layer 4 (3rd CRNNi for CRNN, 4th convolution layers for 3DCNN and 3DCNNS) at iteration/cascade 10 of each network during the forward pass. We selected 16 feature maps out of 64 by firstly clustering them into 16 groups, and then randomly chose one feature map from each group to show as representative feature maps in Fig. 8
. These feature maps show the activations learned from different networks and is colourcoded (blue corresponds to low activation whereas red corresponds to high activation). We see that CRNN’s features look significantly different from CNN. In particular, one can observe that some are activated by the dynamic region, and some are particularly sensitive to regions around the left and/or right ventricle.
V Discussion
In this work, we have demonstrated that the presented network is capable of producing faithful image reconstructions from highly undersampled data, both in terms of various quantitative metrics as well as inspection of error maps. In contrast to unrolled deep network architectures proposed previously, we modelled the recurrent nature of the optimisation iteration using hidden representations with the ability to retain and propagate information across the optimisation steps. Compared with 3D CNN models, the proposed methods have a much lower network capacity but still have a higher accuracy, reflecting the effectiveness of our architecture. This is due to the ability of the proposed RNN units to increase the receptive field size while iteration steps increase, as well as to efficiently propagate information across the temporal direction. In fact, for accelerated imaging, higher undersampling factors significantly add aliasing to the initial zerofilled reconstruction, making the reconstruction more challenging. This suggests that while the 3D CNN possesses higher modelling capacity owing to its large number of parameters, it may not necessarily be an ideal architecture to perform dynamic MR reconstruction, presumably because the simple CNN is not as efficient as propagating the information across the whole sequence. Besides, for the 3D CNN approaches, it is also observed that it is not able to denoise the background region. This could be explained by the fact that 3D CNN only exploits local information due to the small receptive field size it used, while in contrast, the proposed CRNN improves the denoising of the background region because of its larger receptive field sizes.
Furthermore, when exploring the intermediate feature activations, we observed that the pairwise cosine distances for CRNN were smaller than those for the 3DCNNs. We speculate that this is because CRNN has hidden connections across the iterations allowing it to propagate information better and make the endtoend reconstruction process more dynamic, generating less redundant representations. On a contrary, 3DCNNs needs to rebuild the feature maps at every iteration, which is likely to increase repetitive computations. In addition, qualitatively, the activation map of CRNN showed high sensitivity to anatomical regions/dynamic regions. This is likely due to the fact that CRNN has increased receptive field size as well as temporal units, allowing the network to recognise larger/dynamic objects better. In CNNs, one can also observe that there are features activated by the myocardial regions, however, the activation is more homogeneous across the image, due to smaller receptive field size. This hints that CRNN can better capture high level information.
In this work, we modeled the recurrence using the relatively simple (vanilla) RNN architecture. For the future work, we will explore other recurrent units such as LSTM or GRU. As they are trained to explicitly select what to remember, they may allow the units to better control the flow of information and could reduce the number of iterations required for the network to generate highquality output. Also, incorporating recurrent redundancy in kspace domain into the proposed CRNNMRI network is likely to improve the result, and will form part of our future work. In addition, we have found that the majority of errors between the reconstructed image and the fully sampled image lie at the part where motion exists, indicating that motion exhibits a challenge for such dynamic sequence reconstruction. Thus it will be interesting to explore more efficient ways that can improve the reconstruction quality while faithfully preserving cardiac motion. Additionally, current analysis only considers a single coil setup. In the future, we will also aim at investigating such methods in a scenario where multiple coil data from parallel MR imaging can be used jointly for higher acceleration acquisition.
Vi Conclusion
Inspired by variable splitting and alternate minimisation strategies, we have presented an endtoend deep learning solution, CRNNMRI, for accelerated dynamic MRI reconstruction, with a forward, CRNN block implicitly learning iterative denoising interleaved by data consistency layers to enforce data fidelity. In particular, the CRNN architecture is composed of the proposed novel variants of convolutional recurrent unit which evolves over two dimensions: time and iterations. The proposed network is able to learn both the temporal dependency and the iterative reconstruction process effectively, and outperformed the other competing methods in terms of both reconstruction accuracy and speed for different undersampling rates.
References
 [1] H. Jung, J. C. Ye, and E. Y. Kim, “Improved k–t BLAST and k–t SENSE using FOCUSS,” Physics in medicine and biology, vol. 52, no. 11, p. 3201, 2007.
 [2] S. G. Lingala, Y. Hu, E. Dibella, and M. Jacob, “Accelerated dynamic MRI exploiting sparsity and lowrank structure: Kt SLR,” IEEE Transactions on Medical Imaging, vol. 30, no. 5, pp. 1042–1054, 2011.
 [3] J. Schlemper, J. Caballero, J. V. Hajnal, A. Price, and D. Rueckert, “A deep cascade of convolutional neural networks for dynamic MR image reconstruction,” IEEE Transactions on Medical Imaging, vol. 37, no. 2, 2018.
 [4] D. Batenkov, Y. Romano, and M. Elad, “On the globallocal dichotomy in sparsity modeling,” in Compressed Sensing and its Applications. Springer, 2017, pp. 1–53.
 [5] J. Tsao, P. Boesiger, and K. P. Pruessmann, “kt BLAST and kt SENSE: Dynamic MRI with high frame rate exploiting spatiotemporal correlations,” Magnetic Resonance in Medicine, vol. 50, no. 5, pp. 1031–1042, 2003.
 [6] J. Caballero, A. N. Price, D. Rueckert, and J. V. Hajnal, “Dictionary learning and time sparsity for dynamic MR data reconstruction,” IEEE Transactions on Medical Imaging, vol. 33, no. 4, pp. 979–994, 2014.
 [7] R. Otazo, E. Candès, and D. K. Sodickson, “Lowrank plus sparse matrix decomposition for accelerated dynamic MRI with separation of background and dynamic components,” Magnetic Resonance in Medicine, vol. 73, no. 3, pp. 1125–1136, 2015.
 [8] M. Lustig, D. L. Donoho, J. M. Santos, and J. M. Pauly, “Compressed sensing MRI,” IEEE signal processing magazine, vol. 25, no. 2, pp. 72–82, 2008.
 [9] K. Hammernik, T. Klatzer, E. Kobler, M. P. Recht, D. K. Sodickson, T. Pock, and F. Knoll, “Learning a variational network for reconstruction of accelerated MRI data,” Magnetic resonance in medicine, vol. 79, pp. 3055–3071, 2018.

[10]
C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image superresolution,” in
European Conference on Computer Vision
. Springer, 2014, pp. 184–199.  [11] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 [12] D. Lee, J. Yoo, and J. C. Ye, “Deep artifact learning for compressed sensing and parallel MRI,” arXiv preprint arXiv:1703.01120, 2017.
 [13] Y. S. Han, J. Yoo, and J. C. Ye, “Deep learning with domain adaptation for accelerated projection reconstruction MR,” arXiv preprint arXiv:1703.01135, 2017.
 [14] S. Wang, Z. Su, L. Ying, X. Peng, and D. Liang, “Exploiting deep convolutional neural network for fast magnetic resonance imaging,” in ISMRM 24th Annual Meeting and Exhibition, 2016.
 [15] S. Wang, N. Huang, T. Zhao, Y. Yang, L. Ying, and D. Liang, “1D Partial Fourier Parallel MR imaging with deep convolutional neural network,” in ISMRM 25th Annual Meeting and Exhibition, vol. 47, no. 6, 2017, pp. 2016–2017.
 [16] J. Adler and O. Öktem, “Learned primaldual reconstruction,” IEEE Transactions on Medical Imaging, 2018.
 [17] J. Sun, H. Li, Z. Xu et al., “Deep ADMMNet for compressive sensing MRI,” in Advances in Neural Information Processing Systems, 2016, pp. 10–18.
 [18] J. Schlemper, J. Caballero, J. V. Hajnal, A. Price, and D. Rueckert, “A deep cascade of convolutional neural networks for MR image reconstruction,” in International Conference on Information Processing in Medical Imaging, 2017, pp. 647–658.
 [19] J. Adler and O. Öktem, “Solving illposed inverse problems using iterative deep neural networks,” Inverse Problems, vol. 33, no. 2, p. 124007, 2017.
 [20] H. K. Aggarwal, M. P. Mani, and M. Jacob, “MoDL: Model based deep learning architecture for inverse problems,” arXiv preprint arXiv:1712.02862, 2017.
 [21] M. Mardani, H. Monajemi, V. Papyan, S. Vasanawala, D. Donoho, and J. Pauly, “Recurrent generative adversarial networks for proximal learning and automated compressive image recovery,” arXiv preprint arXiv:1711.10046, 2017.
 [22] K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra, “Draw: A recurrent neural network for image generation,” in Proceedings of The 32nd International Conference on Machine Learning, 2015, pp. 1462–1471.

[23]
M. Liang and X. Hu, “Recurrent convolutional neural network for object
recognition,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 3367–3375.  [24] J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3668–3677.
 [25] Y. Huang, W. Wang, and L. Wang, “Bidirectional recurrent convolutional networks for multiframe superresolution,” in Advances in Neural Information Processing Systems, 2015, pp. 235–243.
 [26] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.

[27]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in
Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.  [28] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in The International Conference on Learning Representations (ICLR), 2015.
 [29] K. P. Pruessmann, M. Weiger, M. B. Scheidegger, P. Boesiger et al., “SENSE: sensitivity encoding for fast MRI,” Magnetic resonance in medicine, vol. 42, no. 5, pp. 952–962, 1999.
 [30] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
 [31] S. Ravishankar and Y. Bresler, “MR image reconstruction from highly undersampled kspace data by dictionary learning,” IEEE transactions on medical imaging, vol. 30, no. 5, pp. 1028–1041, 2011.
 [32] X. Miao, S. G. Lingala, Y. Guo, T. Jao, M. Usman, C. Prieto, and K. S. Nayak, “Accelerated cardiac cine MRI using locally low rank and finite difference constraints,” Magnetic resonance imaging, vol. 34, no. 6, pp. 707–714, 2016.