Due to its non-invasive nature and the superior soft-tissue contrast, magnetic resonance imaging (MRI) is becoming increasingly more popular for the adjunct pregnancy screening [10, 13, 12, 14]. The typical scanning time for a 2D MR stack that covers the whole fetus and placenta varies from 1 to 10 minutes per stack depending on the MR sequences, field-of-view (FoV) and slice thickness. The long scanning time inevitably introduces a series of motion artifacts, such as maternal breathing, organ deformation, and fetal movement.
In utero MR image reconstruction. To reconstruct a high quality 3D or 4D placenta and in-utero MRI, accurately estimating the respiratory motion is a key step. The current state-of-the-art methods correct the through-plane motion using slice-to-volume registration (SVR) reconstruction pipelines [5, 9, 8, 3, 15]. These methods require three to nine 2D image stacks to be acquired in orthogonal orientations; then a region-of-interest (ROI) mask needs to be generated manually or automatically [3, 15]; finally the 3D SR image is iteratively generated based on the optimisation of the SVR results, robust statistics, intensity correction, and estimated point spread function (PSF). The acquisition of multiple 2D MR stacks in different orientations is time-consuming compared to the single orientation and inevitably introduces motion artifacts. The registration methods in the SVR pipelines are rigid and cannot correct the non-rigid respiratory motion. The SVR pipelines thus need enough redundant 2D images to reject all the slices where deformation compared to the reconstructed volume occurred. The overall image reconstruction performance will depend on the accuracy of the ROI masking, registration and data redundancy.
is generally considered as a subset of unsupervised learning, where the extensive cost of manual annotations is avoided and replaced by supervisory signals or automatically generated labels. Compared to the popular supervised methods which train the neural network with paired dataand label , the self-supervised methods train with data with its pseudo label , which is generated automatically without involving any human annotation. Several recent papers have explored the usage of the temporal ordering of frames/images as a supervisory signal for complex video analysis [11, 4, 16]. In particular, Wei et al.  explored detecting and learning the direction of time for action recognition and video forensics.
Contribution In this work, we propose a respiratory motion resolved 4D (3D+t) reconstruction pipeline of a single orientation stack of 2D MR slices, based on an bidirectional self-supervised recurrent neural network (RNN)  for identification of breathing states and efficient modified balanced steady state free precession (bSSFP) sequence with the SWEEP technique . The method does not require masking or registration. Our experimental results show that the SWEEP MR acquisition in combination with the proposed pipeline enables 4D (3D +t) SR reconstruction of abdominal and in-utero images, and outperforms the SVR for 3D reconstruction with using less than 20% total slices for each respiratory state. To the best of our knowledge, it is the first successful application of self-supervised network for image-driven respiratory motion estimation and 4D(3D+t) MR SR reconstruction in the medical imaging community.
2.1 Data acquisition
A stack of MR slices is acquired sequentially using a modified bSSFP SWEEP sequence  which allows fast acquisition of large number of densely spaced overlapping slices, thus providing sufficient information for local estimation of respiratory motion. SWEEP continuously shifts the radiofrequency excitation frequency so as to maintain a single stable signal state across a volume, negating the requirement for start-up cycles and resulting in a maximally efficient acquisition for dense slice sampling applications. The acquisition time per slice is 490ms for the uterus scans and 442ms for the kidney scans, which freezes nearly all in-plane respiratory motion. The total scan time depends on the total slice number which is 3 to 10 minutes. This sequence also minimises the effects of fetal motion, by minimising the time between acquisition of the neighbouring slices while maintaining high MR signal. This effectively removes the need for masking, as the data is locally consistent except for the respiratory motion.
2.2 The reconstruction pipeline
The reconstruction pipeline consists of cascading a self-supervised RNN to estimate the respiratory states for each slice and a three layer super-resolution (SR) neural network (SR-net) for reconstruction respiratory-state specific 3D volumes using the respiratory state classes predicted by RNN. The overall pipeline is summarised in Fig..
2.3 Self-supervised RNN
Due to sequential acquisition of slices when using the SWEEP sequence, the respiratory signal is embedded in the neighborhood slices in the arrow of acquisition time. To separate the slices into different respiratory states, we train a bi-directional self-supervised RNN (SRNN).
We first generate a reference volume based on 1D convolution with a Gaussion kernel along the acquisition axis (Z-axis). Intuitively, the reference volume is most similar to the average states of inhale and exhale. We then calculate the normalised cross-correlation (NCC) between each slice in the motion corrupted volume and the reference volume and the average inhale and exhale states are identified as the peaks of NCC sequence. We then separate those two average states based on their timing orders. The remaining states are identified linearly based on the distance between the average states. The approximate states automatically determined by this approached are then used to train bi-directional RNN.
Consider the input motion corrupted MRI scan as a group of 2D image sequence , where the slice number is equivalent to the arrow of time. For analysing temporal features, we use the bidirectional LSTM network to formulate the respiratory states that naturally embedded in the neighbourhood slices, where both past and future events is used for prediction 
. Each LSTM unit computes the hidden vector sequenceand memory cell and output vector sequence by bidirectional iterating from the sequence time to and to . We built a three layer bidirectional LSTM and set the total classes to 10 in the fully connected layer. In this work, we automated annotate each slice with a respiratory state, then segment the volume into multiple 20-slice subvolumes with 1 slice overlap, the input of SRNN is a cosine similarity matrix and the output is the last slice prediction.
2.4 Super-resolution reconstruction (SR-net)
. Our method offer the first time non-example based SR solution, which use PSF as downgrade function and jointly penalize the MSE and TV losses. For each respiratory state, the selected slices are used to perform SR reconstruction. As shown in Fig.1, we train a four layer 3D ConvNet with parametric rectified linear unit (PReLu), where the loss function is defined as the combined reconstruction error and the total variation (TV) regularisation. As proposed previously , we treat PSF as a 3D Gaussian function with Full width at half maximum (FWHM) equal to the slice-thickness in the through-plane direction. The reconstruction error then can be expressed as where refers to the intensity of the voxel in each selected slice indexed by and are that simulated from isotropic super-resolved volume using the PSF. The loss function of SR-net is formulated as:
where is weighting coefficient that balances the TV loss in different orientations, and denotes every possible -dimensional slice of following dimension . SRnet takes less than 1 min at test time for a 4D reconstruction of our data, while previously proposed methods that were build to handle randomly oriented PSFs take around 40 mins  and 5 hours .
2.5 Implementation Details
The method was implemented using Python and Pytorch. The network was trained in two steps, first the SRNN is trained with 1359 subvolumes from 2 subjects with 8 groups of breathing states. Then, the SRnet is trained on 24 3D volumes from 3 subjects. For both SRNN and SRnet training, we use Adam as an optimisation tool. For SRNN, the learning rate has been tested from 0.01 to 1 and set to 0.1 based on empirical results. To avoid the over-fitting, we set the weight decay to 0.01, which add L2 regularization of the weights into the optimisation procedure. For SR-net, we set the TV loss weights to, , and
to enforce the data smoothness in Z-axis. We set the total epoch to 5000. The total training time is 5 hours.
3.1 Simulated experiment
To validate the classification accuracy of the SRNN, we generated a simulated dataset with 5 different respiratory states sampling. For a real in-utero dataset we classified slices into motion states using combination of peak selection and manual input. We then reconstructed the average motion state and registered it to the acquired slices of the other respiratory states. A breathing cycle was then simulated based on the choice of eight slices from each group with random starting state. We tested the peak selection method and the SRNN to the simulated dataset.
Table 1 shows that SRNN achieved close to 80% accuracy for all five breathing states, while original peak selection had much lower accuracy. This was mainly due to confusion of the neighbouring classes or average inhale and exhale states.
3.2 Real data reconstructions
MRI data were acquired on a 3T clinical system (Achieva, Philips Healthcare, Best, Netherlands) using a 2D bSSFP sequence with the SWEEP technique . Informed consent was obtained from 2 healthy adult volunteers (kidney) and 10 pregnant volunteers (gestational ages: 23-36 weeks) who were scanned in the supine position with routine blood pressure and pulse oximetry monitoring. For kidney and uterus acquisitions, the TR/TE is 5.7/2.8 and 7.3/3.6 ms, the sweep rate is 0.37 and 0.17mm/s, and the slice thickness is 3 and 4mm, respectively.
The reconstruction results of the abdominal scan is shown in Fig. 2. For single orientation acquisition motion artifacts are present in SVR reconstruction in spite of the automatic rejection of misaligned slices (b, e). On the other hand, SR reconstruction of slices (c, f) selected using our proposed method resolved most of the breathing artifacts. The bottom row demonstrate the proposed SRNN can accurately separate the inhale and exhale respiratory states.
Fig. 3 shows a similar comparison for abdominal MRI of a pregnant patient. As highlighted in the red box, where the artifact is caused by a deep breath, due to lack of a good target with only one stack of 2D MR images, the state-of-art SVR method failed in the area with large motion corruption.
For quantitative analysis, we calculated the PSNR and SSIM between the reconstruction and sparsely selected 2D images with three different respiratory groups including average, inhale and exhale breathing states. We compared the proposed method with two state-of-the-art SVR software, the SVR  and NiftyMic . For fair comparison, we use the selected 2D images as a target and use free-form deformation (FFD) based method in MIRTK package.111https://mirtk.github.io/ to register the reconstructions from SVR and NiftyMic to each respiratory group. Our 4D reconstruction results are listed as SRNN0. We then register the average state in SRNN0 to the inhale and exhale states and report the results as SRNN1 in Table 2 and Table 3. The reported values in Table 2 and Table 3 are the average results of 10 in-utero subjects. The results show that the proposed reconstruction pipeline can generate SR images with high fidelity to the original MRI scan.
4 Discussion and Conclusion
In this paper we proposed an efficient respiratory motion resolved 4D (3D+t) reconstruction pipeline for abdominal and in-utero MRI. We investigated the respiratory information naturally embedded in the neighborhood slices and use it to train an bidirectional RNN.
We propose a simple but effective motion correction and SR reconstruction pipeline for abdominal and in-utero MRI. The proposed pipeline can accurately cluster the respiratory motion of the acquired 2D images stack. The proposed self-supervised RNN utilise the NCC scores between each 2D slice and Z-axis blurred image. Such breathing motion indicator is very helpful to supervise the respiratory state clustering. The SR reconstruction stage further improves the reconstruction performances. Compared to SVRs, SRnet is a CNN pipeline that takes less than 1 min for a 4D reconstruction, while the SVR ones take around 40 mins  and 5 hours . The PSNR and SSIM comparison results show that with such single orientation acquisition scenarios, the proposed pipeline with less than 20% of the sparsely selected slices outperformed the SVR methods with all the slices.
This work was supported by the National Institutes of Health Human Placenta Project[1U01HD087202‐01], by the Wellcome Trust IEH Award , by the Wellcome/EPSRC Centre for Medical Engineering [WT203148/Z/16/Z] and by the National Institute for Health Research (NIHR) Biomedical Research Centre at Guy’s and St Thomas’ NHS Foundation Trust and King’s College London. The authors also thank Nvdia for the GPU grant.
-  (2010-01) Total Generalized Variation. SIAM Journal on Imaging Sciences 3 (3), pp. 492–526. External Links: Cited by: §2.4.
-  (2016) Image super-resolution using deep convolutional networks. IEEE Trans Pattern Anal Mach Intell 38 (2), pp. 295–307. Cited by: §2.4.
An automated localization, segmentation and reconstruction framework for fetal brain MRI.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 11070 LNCS, pp. 313–320. External Links: Cited by: §1, §3.2, Table 2, Table 3, §4.
Self-supervised video representation learning with odd-one-out networks. In , pp. 3636–3645. Cited by: §1.
-  (2010-10) Robust super-resolution volume reconstruction from slice acquisitions: Application to fetal brain MRI. IEEE Transactions on Medical Imaging 29 (10), pp. 1739–1758. External Links: Cited by: §1.
-  (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5-6), pp. 602–610. Cited by: §1, §2.3.
-  (2018) Respiration resolved imaging using continuous steady state multiband excitation with linear frequency sweeps. In ISMRM, Paris, pp. 5–7. External Links: Cited by: §1, §2.1, §3.2.
-  (2015) Fast Volume Reconstruction From Motion Corrupted Stacks of 2D Slices. IEEE TRANSACTIONS ON MEDICAL IMAGING 34 (9). External Links: Cited by: §1, §4.
Reconstruction of fetal brain MRI with intensity matching and complete outlier removal. Medical Image Analysis 16 (8), pp. 1550–1564. External Links: Cited by: §1, §2.4, §3.2, Table 2, Table 3.
-  (2018-03) Three-dimensional visualisation of the fetal heart using prenatal MRI with motion corrected slice-volume registration.. Lancet 0 (0). External Links: Cited by: §1.
-  (2015) Learning temporal embeddings for complex video analysis. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4471–4479. Cited by: §1.
-  (2018) The use of antenatal fetal magnetic resonance imaging in the assessment of patients at high risk of preterm birth. European Journal of Obstetrics & Gynecology and Reproductive Biology 222, pp. 134–141. External Links: Cited by: §1.
-  (2017) Magnetic resonance imaging assessment of lung volumes in fetuses at high risk of preterm birth. In BJOG-AN INTERNATIONAL JOURNAL OF OBSTETRICS AND GYNAECOLOGY, Vol. 124, pp. 24–24. Cited by: §1.
-  (2019) Magnetic resonance imaging assessment of lung: body volume ratios in fetuses at high risk of preterm birth. In BJOG-AN INTERNATIONAL JOURNAL OF OBSTETRICS AND GYNAECOLOGY, Vol. 126, pp. 8–8. Cited by: §1.
-  (2019) Fully Automatic 3D Reconstruction of the Placenta and its Peripheral Vasculature in Intrauterine Fetal MRI. Medical Image Analysis. External Links: Cited by: §1.
-  (2018) Learning and Using the Arrow of Time. pp. 8052–8060. External Links: Cited by: §1.