Magnetic resonance imaging (MRI) is a medical imaging technique used to capture volumetric image sequences of internal soft tissues, such as cardiac muscles. In comparison to X-Ray imaging (XR) and Computer Tomography (CT), MRI provides images with improved structural details via finer spatial resolutions. Cardiac MRI (CMR) focuses on the heart, allowing trained cardiologists to measure heart parameters, for example the mass of the cardiac muscle (myocardium mass), the volumes of blood cavities (atrial and ventricular volumes) and the amount of blood pumped per heartbeat (ejection fraction) (peng2016). Those parameters are used to assess how healthy is the heart, by recognizing early conditions and signs before the onset of infarcts and other complications.
Due to the size and complexity of CMR sequences, complex techniques are required to produce detailed analyses; one of these techniques is deep learning (DL). Many of the tasks and goals related to the cardiac functional analysis – for example segmentation of structures (Bernard2018), estimation of heart parameters (xue2018full), and detection of diseases (khened2017densely) – have benefited from DL methods. For even better results, research in DL has pointed out that models based on convolutional neural networks (CNN) have had a higher efficacy when provided with regions-of-interest (ROI) either explicitly or implicitly (xue2018full). The detection of ROIs, usually named ROI proposal, is a preprocessing step whose goal is to identify the most prominent regions of an image (frame) for discovering clinically-relevant artifacts.
The explicit ROI proposal approaches usually follow a combination of methods, for example: (a) pipelining a segmentation and a regression network; (b) preprocessing the input with a region proposal algorithm (he2015spatial) or with a CNN (wu2020lv); or (c) by using manual cropping (xue2017full). The implicit ROI detectors are added to the DL network abstracted as additional operators and variables; e.g. multi-scale “Inception” (szegedy2015going) and attention (vaswani2017attention)
modules, which benefit from the ROIs to down-weight less-informative neurons and inputs inside the network. Inception modules weight convolutions of different sizes, while attention modules assign a weight to each feature channel. This additional neural-network information processing guides which input features or channels shall have more weight so to improve the accuracy of the output layer.
In this paper we develop a module that highlights regions in the image sequence by analyzing the motion field using a Radial Basis Function (RBF). In our experiments we analyze our method by using the RBF for cropping the input before having it processed by a pretrained segmentation CNN. Our methodology is an innovation in the task of region proposal for CMR analysis; as we demonstrate, we achieved results that justify the use and further investigation of the employed principles. We named it after its working mechanism as VMF – Visual-Motion-Focus. Section 2 presents the theory and related works regarding ROI proposal and motion detection. Section 3 details VMF, presenting its architecture and describing how each step works. Section 4 shows the experiments over VMF, where we demonstrated excellent performance for cropping the ROI, according to metric Recall. Additionally, we also report our results regarding both Dice and Speed-up metrics. In conclusion, this work demonstrates the application of technique VMF in objective tasks and how to achieve such a setting. Section 7 discusses conclusions and future works.
2 Theory and related work
2.1 Cardiac MRI
MRI is the most precise medical imaging technique for examination of the heart structures, it allows the recording of heart images along a complete heartbeat cycle (earls2002cardiac; seraphim2020quantitative). In practice the magnetization signal is triggered by a reference pulse, then captured several times for noise reduction and, finally, reconstructed by inverse FFT. The resulting image is usually visualized in slices along a positional axis: long-axis has a frontal or lateral view of the heart, and short-axis aligns to a cross-sectional plane.
The short-axis view is split in three regions: the base or basal region near the top where blood vessels connect to the heart (slice B); the middle or medial in between (slice M); and the apex or apical region at the bottom tip of the heart (slice A) – refer to Figure 1. The normal human heart has four chambers: right atrium (RA), right ventricle (RV), left atrium (LA) and left ventricle (LV). The atria receive blood and the ventricles pump it out of the heart. Even though all chambers are important, the LV is of special interest because it is the cardiac muscle that does the “heavy lifting” of pumping oxygenated blood from the lungs to the whole body. In the short-axis view, the LV appears as a ring shape, whose thickness and internal volume measurements are essential to estimate the myocardium mass and ejection fraction, respectively.
2.2 Region proposal and Image segmentation
Region proposal is at the core of image analysis; it is responsible for determining the direction to where, in the image, the analytical algorithms should focus at. Region proposal methods range from classical algorithms (statistical features, signal processing, binary morphology, etc.) to very complex models such as modern Convolutional Neural Networks (CNNs). The goal of such algorithms varies from small outputs such as center points or bounding boxes (e.g. quad-tree segmentation, R-CNN and YOLO (spann1985quad; liu1994multiresolution; girshick2014rich; redmon2016you)) to fine contours (e.g. active contour models, region-growing segmentation, U-net and Mask-RCNN (ranganath1995contour; muhlenbruch2006global; he2017mask)).
In many methodologies, the detected ROIs are the starting point of supervised learning with labeled data. However, region-proposal is also organized according to the region under analysis, e.g. image-level, polygon-level, and pixel-level. In this paper we use the nomenclature from(lin2014microsoft): (a) Image classification assigns labels to the whole image (all of its objects); (b) Object identification assigns labels to the bounding box region of the detected objects; (c) Semantic segmentation assigns labels in polygonal or pixel-level regions, producing finer contours; (d) Instance segmentation not only detects the objects’ regions but also each individual instance of a detected label. Thus, in order to clear any confusion, we consider that parameter estimation is a regression task, and that image segmentation is a polygon or pixel-level classification task as mentioned above.
2.3 Image features and Motion analysis
Some diseases may decrease the extension and intensity of heart movements; in such cases, the motion features will have little to no information. A way to address this issue is by combining the static visual features of the image and the dynamic features. The static visual features are computed by classical feature extractors, then are adjusted to detect the heart that, in short-axis CMRs, appears as a gray bubble with dark borders. The dynamic features are computed by motion estimation.
Many categories of image feature extractors are present in the literature (gonzalez2002digital)
, such as statistical moments (mean, standard deviation, skewness, kurtosis), robust statistics (median, quartiles, interquartile range, percentiles, rank, median absolute error), signal processing filters (Gaussian, Sobel, Laplacian, Gabor) and transforms (Hough, FFT, DCT, DWT), among other measures of information distribution (Shannon entropy, Minkowski-Bouligand box-counting dimension, Hausdorf fractal dimension).
Motion estimation refers to a set of techniques to explore the temporal redundancy in sequences of images, with applications in video compression, analysis and processing (Oliveira2006). One classic formulation of motion analysis is optical flow (horn1981determining), which estimates the displacement field of an object by smooth derivatives of brightness, assuming that the brightness of moving objects does not change between frames. Other approaches include statistical and block-based algorithms; change detection by individual or accumulated frame difference; and parametric methods (Oliveira2006). Motion estimation is mainly used for general 2D video coding, camera motion compensation (thanou2015graph; zhang2018fast; bao2019memc), and for medical imaging (stemkens2016image; mohsin2017accelerated; yan2018left; yu2020foal).
2.4 Radial Basis Functions
A Radial Basis Function (RBF) is a real-valued function of the radius (or distance) from an origin to a given point (majdisova2017radial). They have several applications in function approximation and neural networks (broomhead1988radial; buhmann2003radial)
. RBFs can be understood as a flashlight beam on a wall, which is intense at the center and fades sideways. RBF kernels are built by computing the distance of each pixel position from the center, then transforming this distance by a decaying function (e.g. exp, log, Gaussian). Many center functions can be used as the RBF center, such as statistical measures of central tendency (e.g. simple and weighted mean, median, center-of-mass). Examples of distance functions include the standard deviation (also known as Malahanobis distance, or z-score difference), Minkowski (e.g., , ), Jaccard, and Cosine distances.
2.5 Computer methods for CMR analysis
Computer methods for functional analysis of CMR as reviewed by Peng et al. (peng2016)
were organized in three ways: image-driven, model-driven, and by direct estimation. Further subdivisions of the LV segmentation spans five groups: (1) image processing methods such as thresholding, morphology operators, and region-growing; (2) pixel/voxel-based classification by gaussian mixture models, neural networks, k-means, k-nearest neighbors, or support vector machines (SVM); (3) active contours (snakes), deformable models, level sets, and motion tracking; (4) principal or independent component analyses (PCA and ICA) with strong priors from statistical models of the heart anatomy; and (5) direct estimation by, e.g., latent discriminant analysis (LDA) combined with SVM. This work refers to category 2 as it employs a pixel-based classification mediated by a neural network.
2.6 Deep learning region proposal and LV segmentation
Recent approaches for LV segmentation use CNN models (such as U-net) experimented over diverse datasets and methodological combinations. U-net (ronneberger2015u) is a general segmentation model which combines a tower of downscaled-then-upscaled deep representations by concatenation with the input features, instead of addition like FPN (lin2017feature) and LinkNet (chaurasia2017linknet), with success in LV segmentation. The approach introduced by Smistad et al. (smistad2017)
departs from the pretraining of an U-net CNN using a Kalman-filter segmentation; it achieves 86% Dice (epicardium) in ultrasound images – Dice (DSC) will be detailed in Section4.1. For CMR, U-net displayed even better results (89% Dice) when trained from min-cut priors (guo2018cardiac). An U-net architecture with residual blocks and optical flow information (yan2018left) achieved 89%, 95% and 85% Dice in the base, middle and apex regions of the heart respectively. In the work of Wu et al. (wu2020lv), the authors combine a custom CNN for region proposal with an U-net segmentation to achieve 95% Dice. Overall, the combination of region proposal to U-net had good results in particular datasets, but still has room for improvement when the evaluation generalizes across multiple datasets. Different to former works, our methodology, VMF, uses RBFs to propose ROIs that will aid a CNN processing in the task of LV segmentation. As we demonstrate, we achieve superior results over an ample set of experiments.
3 Material and methods
VMF starts with a 4D image input , that is, a sequence of volumes (frames) each one with a time coordinate. Initially, we normalize to produce sequence with frames in a format more adequate for Neural Network processing – see Section 3.1. From , we extract visual features to produce , and motion estimation to produce , detailed in Section 3.2. Next, we apply two sets of weights: the weights related to visual features; and
, the weights related to the motion, or time; combining both features in tensor. Then, as presented in Section 3.3, we compute the center voxel defined by the energy-weighted sum of all the voxels’ coordinates; then we produce a segmentation map by applying a threshold to and extracting the bounds of non-zero voxels; afterwards we compute the scale given by the standard deviation of all the voxels’ distances from center . At this point, refer to Section 3.4, we can apply a Radial Basis Function at center with radius , computing , then we scale the region defined by to the CNN input shape. In Section 3.6, we explain the Neural Network processing, its parameters and training.
3.1 Intensity normalization
We normalized the image intensities between 0 and 1, by subtracting the minimum value and dividing by the range of values, where is a small constant to avoid division-by-zero:
3.2 Visual features and motion estimation
For visual features extraction, we consider the statistical mean and standard deviation obtained with 33 3 kernels. The mean image is computed using the convolution operation with mean kernel , defined as:
where is a 1 3 3 3 matrix-of-ones. That is, is just the arithmetic average of a 3 3 3 matrix arranged for convolution. With kernel M, we compute as:
In turn, the standard deviation image is computed by taking the differences between the image and mean image , then squaring the differences element-wise (Hadamard power), after that by convolving with the mean kernel , and then taking the Hadamard square-root, as follows:
Accordingly, the mean image and the standard deviation image define . Notice that we express the Hadamard powers using the definitions and notations defined in the work of Fallat and Johnson (fallat2007hadamard).
We used the motion estimate function – refer to Equation 5, given by the root-mean-squared differences of intensity along the time coordinate, where is the time interval (or number of frames) in image , and is the Sobel kernel w.r.t. time, instead of the default and spatial Sobel kernels. This function is related to the magnitude of the optical flow (horn1981determining)vector field in each voxel, as follows:
Figure 2 illustrates a sequence of six CMR frames spaced by of the cardiac cycle (upper row in the figure), and the respective absolute derivatives in each point along the time dimension (lower row) as computed with Equation 5. Figure 3a shows the total motion estimate in the sequence, while Figure 3b presents the cumulative energy histogram. By combining both static and motion features and , we obtain:
The weights and are initialized to and respectively for and , i.e., although we emphasize the motion features, we also include the static visual features, which will address the problematic cases when the heart has limited motility, which might be the case for heart complications.
3.3 Center and scale computation
The center of energy , formalized by Equation 7, is the voxel defined by the energy-weighted sum of all the voxel coordinates in dimensions ; that is, every voxel has , for is the number of voxels in the image, as follows:
The segmentation is given by the thresholded image from the previous step, see Figure 4. The scale estimate is the cube root of the volume (in voxels) of the thresholded image considering the values above the 90% percentile:
where the indicator function returns if the condition is true, or otherwise; and
is the quantile function, returning the maximum of the lowest(%) values in .
3.4 Segmentation and localization
The segmentation is derived from thresholded (Equation 8). The localization is found by fitting an RBF to . The radius is the distance from the center to each voxel . The Euclidean distance (-norm ) was used:
The chosen RBF is a Gaussian of the voxel distances:
Both outputs and are illustrated in Fig. 4. According to the framework depicted so far, we can design a focus for many different objects in the images by changing the functions for image feature extraction, motion estimation, center, scale and RBF. It is also possible to detect multiple objects or objects of complex shapes by a mixture of RBF models.
3.5 Crop, scaling and fine CNN segmentation
This step performs the final preparation of the image so that it will fit the CNN input shape; this is necessary because the MRI images may have diverse dimensions. After all the RBF estimation steps executed in the original resolution, the proposed region is cropped then adjusted, followed by an intensity normalization. For CNNs with fixed input dimensions, we rescale images using bicubic spline interpolation. For CNNs with variable input shape, we only adjust the image proportions as requested by the model, e.g. U-net has five max-pooling layers with down-scaling factor 2, which means the input dimensions should be multiples of 2in order to have at least 1 pixel in the encoder output.
3.6 Neural Network ROI proposal
VMF was developed in combination with a 2D U-net CNN (ronneberger2015u) for simplicity, but it is suited to any other convolutional network. The U-net is a general encoder-decoder segmentation architecture with success in LV segmentation, as explained in Section 2.6. The CNN was trained with outputs obtained with VMF
during 30 epochs using an adaptive momentum optimizer, initial learning rate= 0.001, Nesterov = 0.9, -decay = 0.999 and the loss below, which is binary cross-entropy plus DSC score:
The VMF pipeline executes in the whole cardiac volume and, after cropping, the cropped ROI is passed to the segmentation CNN, which in this case is the 2D U-net.
To evaluate the VMF framework we used three CMR datasets specified across the short-axis orientation. The metadata in the datasets include: a) binary masks for the LV and RV; b) physiological parameters such as myocardium mass, ventricle area, volume, ejection fraction, thickness, and dimensions of structures; c) image acquisition parameters such as spatial resolution (mm), temporal resolution (frames per cardiac cycle) and slice gaps (mm). Not all of the datasets encompass the same information; following we provide more details, with a summary presented in Table 1.
LVSC - Cardiac Atlas Project (CAP) 2011 LV Segmentation Challenge (SUINESIAPUTRA2014). We used the 100 patients in the training set, which have LV masks on all the frames;
ACDC - MICCAI 2017 Automated Cardiac Diagnosis Challenge (Bernard2018), training dataset. It was created from clinical data, including sequences of 100 patients from the University Hospital of Dijon (France) over 6 years, it includes RV and LV masks in two frames;
M&Ms - MICCAI 2020 Multi-Centre, Multi-Vendor & Multi-Disease Cardiac Image Segmentation Challenge (saber2020multi). This database was collected from six hospitals in Spain, Canada, and Germany using several MRI scanners (Siemens, GE, Philips and Canon), it includes RV and LV masks in two frames. We used 320 patients for which the labels are publicly available.
In order to evaluate VMF, we analyzed the selected datasets by applying VMF, then compared the results of two CNNs: one base CNN on the raw images without VMF, and another CNN combined with the images processed by VMF. After the execution of VMF and cropping of ROIs, only the slices and frames with label masks of at least 25 pixels are passed to the U-net (i.e. stray markings are discarded). The labels are obtained by subtracting epicardium and endocardium masks, i.e. they contain the myocardium region. The datasets were used individually for training and validation of models, then combined in an all-versus-all scheme.
We compared three indices: (1) Recall, the proportion of the labels that was preserved in the VMF region proposals; it is also known as Sensitivity or True Positive Rate – TPR as defined by Equation 13, which aims to verify if the bounding boxes cover the labels entirely; (2) the Sørensen-Dice Coefficient – DSC as defined by Equation 14
, which is equivalent to the F1-score (average of Precision and Recall) that refers to the segmentation output when not-using vs using methodVMF; (3) speedup, the ratio of the times taken by the segmentation CNN when not-using vs using method VMF – it is defined as ().
Recall ranges from 0.0, when none of the marked voxels are detected; to 1.0, when 100% of the marked voxels are detected. DSC ranges from 0.0, for no intersection; to 1.0, for a perfect match between and . Speedup is a positive ratio with value 1.0 when both methods take the same time; 1.0 when VMF is faster; and 1.0 when VMF is slower.
4.2 Test Setup
The experiments ran in a rack-mounted machine with Intel i7-7700k CPU and NVIDIA Titan-Xp GPU. Our scripts were written in Python, scipy, keras/tensorflow, and matplotlib. The steps were: load an image; compute features; fit RBF; crop and resize both the input image and the label mask; Base CNN prediction on the original image andVMF-CNN prediction on the cropped image; then, evaluate the predicted masks against the labels using the aforementioned metrics.
Table 2 presents the results of VMF region proposals (Recall) and VMF-CNN combination (Dice score). After applying method VMF, the U-net CNN trained on dataset ACDC had the worst Dice scores, with a mean performance decrease of 8.3% when considering all the datasets. On the other hand, the model trained on dataset M&Ms was significantly improved when using the VMF ROI proposal, with a mean performance increase of +7.2 (percent points) considering all the datasets. With dataset LVSC, we observed a lower performance when validating over dataset LVSC (-3.5), but an improvement when validating with datasets ACDC and M&Ms (+4.0 and +2.5, respectively).
|Recall of 98.84%||M&Ms||67.6%||47.4%||-20.2|
|Recall of 99.75%||M&Ms||81.3%||82.2%||+0.9|
|Recall of 99.75%||M&Ms||67.0%||69.5%||+2.5|
|Recall of 99.69%||M&Ms||77.2%||83.0%||+5.8|
We applied the McNemar’s chi-squared test with continuity correction from package statsmodels 0.12.2, whose results asserted that the Base predicitons and predictions after VMF were significantly different, with (see Table 3).
|VMF (T)||VMF (F)|
From a practical perspective we also compared the training speed, which is paramount to larger experiments such as hyper-parameter search. Table 4 shows the training time of the networks without and with method VMF; in all the cases our method improved the training speed by 150%.
Based on our results, the main achievement of method VMF is the ability to identify and crop the ROIs with a very small error (Recall = 99.69%). Furthermore, by cropping the ROIs before the CNN training, we observed that the training process was remarkably accelerated, increasing the training speed by 150%. When considering metric DSC, we observed significant improvements when considering the entire bundle of experiments, as presented in Table 2 but, for dataset ACDC when used for training, the results were not promising.
This inefficiency with ACDC is possibly related to imaging artifacts and standard operating procedures used to label the ACDC images (refer to Figure 5). In general, VMF is able to automatically focus on the correct region and guide the segmentation CNN (Figure 5, P27). However, in some cases the CNN lost track of the LV when fed with a slightly displaced ROI (P32). In the rarest of cases the CNN is unable to detect anything, even when fed with a centered ROI (P91), and this is a known problem when segmenting the cardiac apex. These findings should be investigated in future work.
Most of these issues were subsequently reviewed in M&Ms (saber2020multi) by: a) complete coverage, including papillary muscles, of RV and LV; b) no interpolation of masks at the base; c) guarantees of larger RV surface in end-diastole compared to end-systole, excluding the pulmonary artery; d) sampling data from several hospitals, different patients and different equipment vendors. The outcome of these corrections is clear when comparing the results from training in ACDC and testing M&Ms (-20.2) vs training on M&Ms and testing on ACDC (+6.2). M&Ms is very likely to contain images following the same patterns of ACDC and others, but not the opposite.
Overall, method VMF demonstrated significant improvements in the task of ROI detection; from the results, it became evident that the performance depends on the training dataset and on the network model – image quality, format and intensities should be similar across datasets, and labeling standard operating procedures should be compatible.
In this paper, we proposed method VMF, a novel approach based on convolution operations and on the use of a Radial Basis Function to detect the ROIs in cardiac magnetic resonance images. We validated VMF with a U-net CNN comparing our results to those of the canonical U-net and of the VMF-CNN in three public reference datasets. According to our results, VMF was able to crop 99.69% (Recall metric) of the ROI voxels in all the datasets, being suitable to preprocess the data for CNN segmentation. VMF accelerated the training process by 150%, and also increased metric Sørensen-Dice Coefficient in the majority of our test cases (). Further improvement is possible by extended preprocessing of the training datasets, and change the U-net CNN after the ROI proposal step to a more advanced architecture with fine-tuning to the dataset specifics, specially in cases such as ACDC.
As future work, we intend to expand this pipeline considering possibilities: (1) embed this method as the first layer of a CNN, so that the parameters are adjusted automatically; (2) use other networks in the segmentation step, such as 3D Unet or Feature Pyramid Networks; (3) experiment with more datasets, preprocessing, and augmentation methods; (4) test different feature extractors for the static visual features, such as Gray-Level Co-occurrence Matrices and Gabor filters, other motion estimation methods, and Radial Basis Functions.
We thank Dr. Diego Cardenas and Dr. Thiago Costa for reviewing equations, Nvidia Corporation for donating the GPU used in this work, and USP, FAPESP, CAPES, CNPq, Zerbini Foundation and FoxConn for supporting this research.
Conflict of interest: The authors declare that they have no conflict of interest.
Research data: The datasets used in this research are available upon registration on https://www.ub.edu/mnms/, https://acdc.creatis.insa-lyon.fr/description/databases.html and https://www.cardiacatlas.org/challenges/lv-segmentation-challenge/