Image quality assessment (IQA) is a fundamental apparatus for evaluating MR images [19, 8, 27]. The main purpose of this process is to find out if the quality can guarantee images are diagnostically reliable and exempted from artefacts - to avoid possible unreliable diagnosis [5, 18]. Often the evaluation process requires time and is also subjectively dependent upon the observer in charge of carrying it out 
. Furthermore, different levels of expertise and experience of the readers (experts designated to perform the IQA) could lead to a non-perfect matching assessment. Another intrinsic issue of the IQA for MR images is the absence of a reference image. No-Reference IQA techniques with and without the support of machine and deep learning support have been proposed in the last years for the evaluation of the visual image quality[5, 9, 27, 36, 36, 10, 23, 22, 11, 12, 4]. These techniques are able to detect and quantify the level of blurriness or corruption with different levels of accuracy and precision. However, there are many factors to take into consideration when choosing which technique to apply, the most important are [14, 13, 2]
: data requirement - as deep learning requires a large dataset while traditional machine learning (non deep learning based) techniques can be trained on lesser data; accuracy - deep learning provides higher accuracy than traditional machine learning; training time - deep learning takes longer time than traditional machine learning; hyperparameter tuning - deep learning can be tuned in various different ways, and it is not always possible to find the best parameters, while machine learning offers limited tuning capabilities. In addition, when choosing traditional machine learning techniques, the fundamental step of feature extraction must be considered. Although the list of traditional machine learning and deep learning techniques used for regression and classification tasks is constantly updated[32, 25, 35, 24], there is still no gold standard IQA for MR images . The aim of this work is to create an automated IQA tool that is able to detect the presence of motion artefacts and quantify the level of corruption or distortion compared to an ”artefact-free” counterpart, based on the regression of the structural similarity index (SSIM) . This tool has been designed to be able to work for a large variety of MR image contrast, such as T1, T2, PD and Flair weighted images, and independently from the resolution and orientation of the considered image. Additionally, a contrast augmentation step has been introduced in order to increase the range of variability of the weighting. In practice, when the MRIs are acquired and if there are any artefacts in the image, ”artefact-free” counterparts are not available to compare the image against for quality assessment. But for the SSIM calculation, it is always necessary to have two images (corrupted vs motion-artefact free images). For this reason, in this work, the corrupted images were artificially created by making use of two different algorithms - one implemented by Shaw et al.  (package of the library TorchIO ) and a second algorithm developed in-house . Furthermore, when training a neural network model in a fully-supervised manner, as in this case, a large amount of labelled or annotated data is typically required . In this research on IQA, the regression labels for training were created by comparing the artificially-corrupted images against the original artefact-free images with the help of SSIM and those SSIM values were finally used as the regression labels.
The proposed automatic IQA tool relies on residual neural networks (ResNet) [15, 28]. Two different versions of ResNet were used, with 18 (ResNet-18) and 101 (ResNet-101) residual blocks. Every model has been trained two times - with and without the contrast augmentation step. These are steps executed during the training, Figure 1:
Given a 3D input volume one random slice (2D image) is selected from one of the possible orientations - axial, sagittal, and coronal. In case of an anisotropic volume, the slice selection is done only following the original acquisition orientation.
In case of contrast augmentation is enabled, one of the contrast augmentation algorithms is selected randomly from the following and applied on the input image:
The SSIM is calculated between the 2D input image and the corresponding corrupted one.
The calculated SSIM value and the corrupted image are passed to the chosen model for training
Three datasets - train, validation, and test sets - were used for this work, Table 1. For training, 200 volumes were used, while 50 were used for validation and 50 for testing. The first group of 68 volumes were selected from the public IXI dataset 111Dataset available at: https://brain-development.org/ixi-dataset/, the second group (Table 1, Site-A) of 114 volumes were acquired with a 3T scanner, the third group (Table 1, Site-B) of 93 volumes was acquired at 7T, and a final group (Table 1, Site-B) of 25 volumes was acquired with different scanners (1.5 and 3T). The volumes from IXI, Site-A, and Site-B were resampled in order to have an isotropic resolution of 1.00 .
|Data||Weighting||Volumes||Matrix Size||Resolution ()|
|m(M) x m(M) x m(M)†||m(M) x m(M) x m(M)†|
|†: ”m” indicates the minimum value while ”M” the maximum.|
For testing, a total of 10000 images were repetitively selected randomly and then corrupted from the 50 volumes of the test dataset - applying the random orientation selection, the contrast augmentation, and finally the corruption - as performed during the training stage.
In order to evaluate the performances of the trained models, first the predicted SSIM values were plotted against the ground truth SSIM values as shown in Figure 3, next the residuals were calculated as follow , Figure 4.
The predicted SSIM value of an image can be considered equivalent to measuring the distortion or corruption level of the image. However, when applying this approach to a real clinical case, it is challenging to compare this value with a subjective assessment. To get around this problem, the regression task was simplified into a classification one. For the same, three different experiments were performed by choosing a different number of classes - 3, 5 and 10 classes. For every case, the SSIM range [0-1] was equally divided in order to have equal sub-ranges. For instance, in case of 3 classes, there were three sub-ranges, class-1:[0.00-0.33], class-2:[0.34-0.66] and class-3:[0.67-1.00]. A similar step was also performed for creating 5 and 10 classes.
A second dataset was also used for testing the trained models - comprised of randomly selected images from clinical acquisitions. This dataset contained five subjects, each with a different number of scans, as shown in Table 2. In this case, there were no ground truth reference images, and for this reason, the images were also subjectively evaluated by one expert using the following classification scheme: class 1 - images with good to a high quality that might have minor motion artefacts, but not altering structures and substructures of the brain (SSIM range between 0.85 and 1.00); class 2 - images with sufficient to good quality, in this case, the images can have motion artefacts that prevent a correct delineation of the brain structures, substructures or lesions (SSIM range between 0.60 and 0.85); and class 3 - image with insufficient quality and a re-scan will be required (SSIM range between 0.00 and 0.60). Additionally, this dataset contained different contrasts not included in the training, such as diffusion-weighted images (DWI).
|Data||Weighting||Volumes||Matrix Size||Resolution ()|
|m(M) x m(M) x m(M)†||m(M) x m(M) x m(M)†|
|Subj. 4||T2, FLAIR, DWI||1,2,6||144(512)x144(512)x20(34)||0.45(1.40)x0.45(1.40)x2.00(4.40)|
|Subj. 5||T2, FLAIR, DWI||3,1,4||256(640)x256(640)x28(42)||0.40(1.09)x0.40(1.09)x3.30(6.20)|
|†: ”m” indicates the minimum value while ”M” the maximum.|
The results for the first section, the regression task, are presented in Figures 3 and 4. Figure 3 shows a scatter plot where the predicted SSIM values are compared against the ground truth values. Additionally, the plot shows the plotted linear fitting performed for each trained model. Finally, the distributions of the ground truth and predicted SSIM values are also shown. Figure 3 presents general comparisons across all the trained models and their qualitative dispersion levels. In this case, the term dispersion implies how much the predicted SSIM values differ from the ground-truth . On the other hand, in Figure 4, the results are shown separately using the scatter plots - for each model. The relative residual distribution plots are explained in section 2
. For the residual distributions, a further statistical normal distribution fitting was carried out, making the use of the python package SciPy. The calculated mean and standard deviation values are shown in Figure 4. According to the statistical analysis, the model that has the smallest standard deviation () and the mean value closer to zero () was the ResNet-18 model trained with contrast augmentation, while the model with the mean value farther from zero and largest standard deviation was the ResNet-101 trained without contrast augmentation. A clear effect of the contrast augmentation for both models ResNet-18 and ResNet-101 can be seen from the results - reflected as a reduction of the standard deviation values, and this visually correlates with a lower dispersion level in the scatter plots.
The results for the classification task are shown in Figure 5 and table 3. Figure 5 shows the logarithmic confusion matrices obtained for the classification task. From the matrices, it can be noted that all the trained models performed well and in a similar way. In particular, none of the matrices presents non-zero elements far from the diagonal, but only the neighbouring ones - as commonly expected from a classification task. The table 3 is complementary to Figure 5. It shows the class-wise, macro-average and weighted average of precision, recall, and f1-score for all the trained models. This table also presents the accuracy. For all the three scenarios, 3, 5 and 10 classes as presented in section 2, once again, the model with the best results is ResNet-18 trained with contrast augmentation. This model always obtained the highest accuracy value - 97, 95 and 89% for 3, 5, and 10 class scenarios, respectively. Even though the ResNet-18 with contrast augmentation performed better than the other models, no substantial differences can be discerned from the tabular data. But once again, it is possible to observe an improvement in terms of performance when the contrast augmentation is applied.
|1 [0.00 - 0.33]||0.94||0.97||0.95||0.93||0.97||0.95||0.93||0.98||0.96||0.97||0.89||0.93||117|
|2 [0.033 - 0.66]||0.95||0.96||0.95||0.97||0.96||0.96||0.94||0.97||0.95||0.98||0.94||0.96||4307|
|3 [0.66 - 1.00]||0.97||0.96||0.97||0.97||0.98||0.97||0.98||0.95||0.96||0.95||0.99||0.97||5576|
|1 [0.00 - 0.20]||0.97||0.91||0.94||0.93||0.79||0.85||0.94||0.97||0.96||0.85||0.88||0.87||33|
|2 [0.20 - 0.40]||0.86||0.89||0.88||0.85||0.90||0.87||0.83||0.91||0.87||0.93||0.77||0.84||262|
|3 [0.40 - 0.60]||0.91||0.92||0.91||0.93||0.92||0.93||0.89||0.94||0.91||0.94||0.90||0.92||2320|
|4 [0.60 - 0.80]||0.94||0.95||0.94||0.95||0.96||0.96||0.94||0.94||0.94||0.94||0.96||0.95||5021|
|5 [0.80 - 1.00]||0.96||0.93||0.95||0.96||0.96||0.96||0.97||0.92||0.95||0.95||0.96||0.96||2364|
|1 [0.00 - 0.10]||1.00||0.50||0.67||1.00||0.62||0.77||1.00||0.62||0.77||1.00||0.75||0.86||8|
|2 [0.10 - 0.20]||0.81||0.88||0.85||0.78||0.72||0.75||0.83||0.96||0.89||0.75||0.84||0.79||25|
|3 [0.20 - 0.30]||0.90||0.90||0.90||0.81||0.84||0.83||0.87||0.89||0.88||0.91||0.79||0.84||62|
|4 [0.30 - 0.40]||0.81||0.84||0.83||0.80||0.85||0.83||0.76||0.85||0.80||0.88||0.71||0.79||200|
|5 [0.40 - 0.50]||0.82||0.86||0.84||0.86||0.87||0.87||0.79||0.87||0.83||0.86||0.83||0.84||689|
|6 [0.50 - 0.60]||0.84||0.84||0.84||0.89||0.87||0.88||0.83||0.86||0.84||0.89||0.84||0.86||1631|
|7 [0.60 - 0.70]||0.86||0.88||0.87||0.89||0.89||0.89||0.85||0.87||0.86||0.88||0.88||0.88||2706|
|8 [0.70 - 0.80]||0.87||0.87||0.87||0.89||0.90||0.89||0.88||0.84||0.86||0.86||0.90||0.88||2315|
|9 [0.80 - 0.90]||0.86||0.88||0.87||0.89||0.92||0.90||0.89||0.85||0.87||0.87||0.91||0.89||1456|
|10 [0.80 - 1.0]||0.97||0.86||0.91||0.97||0.91||0.94||0.96||0.88||0.91||0.95||0.93||0.94||908|
The results regarding the clinical data samples are shown in Figure 6. In this case, the obtained SSIM predictions are shown for each model - overlayed with the subjective scores - shown in a per-slice manner grouped by the subjects. As introduced in section 2, the subjective ratings for the clinical data samples were within the classes 1, 2 or 3 - after a careful visual evaluation. If the predictions obtained with the different models fall within the classes assigned by the subjective evaluation, this will imply that there is an agreement between the objective and subjective evaluations. When the objective prediction lies outside the class assigned by the expert, this indicates a disagreement between the two assessments. The percentage of agreement between subjective and objective analysis is (mean standard deviation), with a minimum value of achieved by ResNet-101 without contrast augmentation and a maximum of by ResNet-101 with contrast augmentation.
The performances of the trained models when solving the regression task were very similar. However, when the two models ResNet-18 and ResNet-101 were coupled with contrast augmentation showed a distinct improvement. Looking at the Residuals distributions of the errors, for both models, contrast augmentation has been the reason why the mean values fell closer to zero and also, the values of the standard deviation decreased by and times for ResNet-18 and ResNet-101, respectively. The reduction of the standard deviations is quite evident also in the scatter plots, where the dispersion level is visibly less when the contrast augmentation is applied.
While considering the classification task, the first notable thing is that there is a linear decrease in the accuracy as the number of classes increases - , and
. This can be explained by the fact that as the number of classes increases, the difficulty level also increases for each model to classify the image in the correct pre-defined range of SSIM values. The confusion matrices confirm this behaviour - by the increase of the values being out-of-diagonal, i.e., considering the ResNet-18 not coupled with contrast augmentation, for the classification task with three classes, the maximum value out-of-diagonal is 0.04 (for class-2 and class-3), while considering the classification task with ten classes, the maximum value is 0.50 (for class-1). This implies that the ResNet-18, not coupled with contrast augmentation when performing the 10-classes classification task, classifies incorrectlyof the tested images. When contrast augmentation is applied, there is an apparent reduction of wrongly classified images of class-1. Although this is the general trend observed in Figure 5, there are also contradiction results, i.e., when looking at the 5-classes classification task for class-1 always considering ResNet-18 without and with contrast augmentation, there is a net increase of erroneously classified class-1 images, from to of tested images.
The final application on clinical data also provided satisfactory results, with a maximum agreement rate of between the objective and subjective assessments. A direct comparison with the previous three-classes classification task is not possible due to the different subjective schemes selected (Section 2). Although there is a visible reduction in the performance when the trained models are applied to clinical data, this can be justified by taking into account several factors. First of all, the clinical data sample involved type of image data, such as diffusion acquisition and derived diffusion maps, which were never seen by the models during the training step, and secondly, the motion artefacts artificially created did not cover the infinite possible motion artefacts that can appear in a truly MR motion corrupted image. A possible improvement can be obtained by introducing new contrasts in the training set, different resolutions and orientations. For example, oblique acquisitions have been not considered in this work. In addition, the artificial corruption methods used for this work can be further improved, e.g., including corruption algorithms based on motion log information recorded by a tracking device, as commonly used for prospective motion correction , , . However, this would require the availability of raw MR data, and it has to be taken into account also the computational time to de-correct the images, comparably slower than the current approaches. Another point to take into account for the subjective assessment is the bias introduced by each expert while evaluating the image quality. In this work, the expert’s perception of image quality is emulated with good accuracy, , which can not be considered as a standard reference. Although the subjective assessment can be repeated with the help of several experts, there will always be differences between them, i.e., years of experience or different sensitivity to the presence of motion artefacts in the assessed image. It is also noteworthy that the SSIM ranges defined for the three classes can be re-defined following a different scheme. In the scenario explored in this paper, the scheme has been defined by making use of the artificially corrupted images and the ground truth images - this allowed an exact calculation of the SSIM values, and it was simple to define ranges that visually agree with the scheme defined in Sect. 2.
This research presents an SSIM-regression based IQA technique using ResNet models, coupled with contrast augmentations to make them robust against changes in the image contrasts in clinical scenarios. The method managed to predict the SSIM values from artificially motion corrupted images without the ground-truth (motion-free) images with high accuracy (residual SSIMs as less as ). Moreover, the motion classes obtained from the predicted SSIMs were very close to the true ones and achieved a maximum weighted accuracy of for the ten classes scenario as reported in Table 3, and achieved a maximum accuracy value of when the number of classes was three (Table 3). Considering the complexity of the problem in quantifying the image degradation level due to motion artefacts and additionally the variability of the type of contrast, resolution, etc., the results obtained are very promising. Further evaluations, including multiple subjective evaluations, will be performed on clinical data to judge its clinical applicability and robustness against changes in real-world scenarios. In addition, other trainings will be carried out in order to have a larger variety of images that should include common clinical routine acquisitions such as diffusion-weighted imaging and Time-of-Flight imaging. Furthermore, it would be beneficial to include images also acquired at lower magnetic field strength ( T). Considering the results obtained by ResNet models in this work, it is reasonable to think that future works can also be targeted towards a different anatomical body part, focusing, for instance, on abdominal or cardiac imaging. However, the reproduction of real looking-like motion artefacts plays a key role in the performances of deep learning models trained to have a reference-less image quality assessment tool.
-  (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13 (3), pp. 469–475. Cited by: §2.
-  (2020) Hands-on machine learning with scikit-learn and scientific python toolkits: a practical guide to implementing supervised and unsupervised machine learning algorithms in python. Packt Publishing Ltd. Cited by: §1.
-  (2001) Using labeled and unlabeled data for training. Cited by: §1.
-  (2016) Quality control of structural mri images applied using freesurfer—a hands-on workflow to rate motion artifacts. Frontiers in neuroscience 10, pp. 558. Cited by: §1.
-  (1999) Automatic quality assessment protocol for mri equipment. Medical physics 26 (12), pp. 2693–2700. Cited by: §1.
-  (1999) Image lightness rescaling using sigmoidal contrast enhancement functions. Journal of Electronic Imaging 8 (4), pp. 380–393. Cited by: 3rd item.
-  (2020) Retrospective motion correction of mr images using prior-assisted deep learning. arXiv preprint arXiv:2011.14134. Cited by: §1.
-  (2016) Review of medical image quality assessment. Biomedical signal processing and control 27, pp. 145–154. Cited by: §1.
-  (2017) Modified-brisque as no reference image quality assessment for structural mr images. Magnetic resonance imaging 43, pp. 74–87. Cited by: §1.
-  (2017) MRIQC: advancing the automatic prediction of image quality in mri from unseen sites. PloS one 12 (9), pp. e0184661. Cited by: §1.
Automatic detection of motion artifacts on mri using deep cnn.
2018 International Workshop on Pattern Recognition in Neuroimaging (PRNI), pp. 1–4. Cited by: §1.
-  (2021) Automatic mr image quality evaluation using a deep cnn: a reference-free method to rate motion artifacts in neuroimaging. Computerized Medical Imaging and Graphics 90, pp. 101897. Cited by: §1.
-  (2019) . ” O’Reilly Media, Inc.”. Cited by: §1.
-  (2016) Deep learning. MIT press. Cited by: §1.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: §2.
-  (2014) Reproduction of motion artifacts for performance analysis of prospective motion correction in mri. Magnetic Resonance in Medicine 71 (1), pp. 182–190. Cited by: §4.
-  (1995) Machine vision mcgraw-hill international editions. New York. Cited by: 2nd item.
-  (2009) The physical basis of spatial distortions in magnetic resonance images. Cited by: §1.
-  (2019) Image quality assessment: a review to full reference indexes. Recent trends in communication, computing, and electronics, pp. 279–288. Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.
-  (1998) The role of gamma correction in colour image processing. In 9th European Signal Processing Conference (EUSIPCO 1998), pp. 1–4. Cited by: 1st item.
-  (2018) A machine-learning framework for automatic reference-free quality assessment in mri. Magnetic Resonance Imaging 53, pp. 134–147. Cited by: §1.
-  (2018) Automated reference-free detection of motion artifacts in magnetic resonance images. Magnetic Resonance Materials in Physics, Biology and Medicine 31 (2), pp. 243–256. Cited by: §1.
-  (2011) The changing science of machine learning. Machine learning 82 (3), pp. 275–279. Cited by: §1.
-  (2022) Research and application of deep learning in image recognition. In 2022 IEEE 2nd International Conference on Power, Electronics and Computer Applications (ICPECA), pp. 994–999. Cited by: §1.
-  (2020) Diagnostic image quality assessment and classification in medical imaging: opportunities and challenges. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 337–340. Cited by: §1.
-  (2009) Automatic quality assessment in structural brain magnetic resonance imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 62 (2), pp. 365–372. Cited by: §1.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Cited by: §2.
-  (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Table 3.
-  (2021) TorchIO: a python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning. Computer Methods and Programs in Biomedicine 208, pp. 106236. Cited by: §1, 1st item.
-  (1987) Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing 39 (3), pp. 355–368. Cited by: 4th item.
Deep convolutional neural networks for image classification: a comprehensive review. Neural computation 29 (9), pp. 2352–2449. Cited by: §1.
-  (2022) Quantitative evaluation of prospective motion correction in healthy subjects at 7t mri. Magnetic resonance in medicine 87 (2), pp. 646–657. Cited by: §4.
-  (2018) MRI k-space motion artefact augmentation: model robustness and task-specific uncertainty. In International Conference on Medical Imaging with Deep Learning–Full Paper Track, Cited by: §1, 1st item.
-  (2022) Foundations of machine learning-based clinical prediction modeling: part v—a practical approach to regression problems. In Machine Learning in Clinical Neuroscience, pp. 43–50. Cited by: §1.
-  (2018) Automated image quality evaluation of structural brain magnetic resonance images using deep convolutional neural networks. In 2018 9th Cairo International Biomedical Engineering Conference (CIBEC), pp. 33–36. Cited by: §1.
-  (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Cited by: §3.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §1.
-  (2016) Reverse retrospective motion correction. Magnetic resonance in medicine 75 (6), pp. 2341–2349. Cited by: §4.