Log In Sign Up

Recurrent Aggregation Learning for Multi-View Echocardiographic Sequences Segmentation

Multi-view echocardiographic sequences segmentation is crucial for clinical diagnosis. However, this task is challenging due to limited labeled data, huge noise, and large gaps across views. Here we propose a recurrent aggregation learning method to tackle this challenging task. By pyramid ConvBlocks, multi-level and multi-scale features are extracted efficiently. Hierarchical ConvLSTMs next fuse these features and capture spatial-temporal information in multi-level and multi-scale space. We further introduce a double-branch aggregation mechanism for segmentation and classification which are mutually promoted by deep aggregation of multi-level and multi-scale features. The segmentation branch provides information to guide the classification while the classification branch affords multi-view regularization to refine segmentations and further lessen gaps across views. Our method is built as an end-to-end framework for segmentation and classification. Adequate experiments on our multi-view dataset (9000 labeled images) and the CAMUS dataset (1800 labeled images) corroborate that our method achieves not only superior segmentation and classification accuracy but also prominent temporal stability.


MCGNet: Partial Multi-view Few-shot Learning via Meta-alignment and Context Gated-aggregation

In this paper, we propose a new challenging task named as partial multi-...

AA-RMVSNet: Adaptive Aggregation Recurrent Multi-view Stereo Network

In this paper, we present a novel recurrent multi-view stereo network ba...

Multi-Scale Attentional Network for Multi-Focal Segmentation of Active Bleed after Pelvic Fractures

Trauma is the worldwide leading cause of death and disability in those y...

SketchDesc: Learning Local Sketch Descriptors for Multi-view Correspondence

In this paper, we study the problem of multi-view sketch correspondence,...

Multi-view Data Classification with a Label-driven Auto-weighted Strategy

Distinguishing the importance of views has proven to be quite helpful fo...

Highly Efficient Follicular Segmentation in Thyroid Cytopathological Whole Slide Image

In this paper, we propose a novel method for highly efficient follicular...

Attentional Aggregation of Deep Feature Sets for Multi-view 3D Reconstruction

We study the problem of recovering an underlying 3D shape from a set of ...

1 Introduction

Multi-view echocardiographic sequences delineation provides important insight for clinical diagnosis. The knowledge pattern of cardiac structures and textures associated with deforming tissues can be observed in echocardiographic sequence while in single frames the information is always missing and incomplete [1]. Echocardiographic sequence also permits the assessment of wall motion and identification of end-diastolic (ED) and end-systolic (ES) phases. Cardiologists usually check multi-view echocardiographic sequences in clinical decision-making [2]. The apical-2-chamber view (A2C), A3C, and A4C are the most commonly used views for the left ventricle (LV) functional assessment. Most clinical indexes of the LV (e.g., area, volume, and ejection fraction) are basically measured in these standard apical views. Segmentation of the LV is generally a prerequisite for such quantitative analysis [3]. In clinical routine, quantitative analysis of the LV still involves careful review and massive manual interpretation by experts, which is a tedious and time-consuming task. Thus, automatic methods are desired to facilitate this process. However, multi-view echocardiographic sequences segmentation remains a challenging task as illustrated in Fig. 1

. First, the fuzzy border, huge noise, and abounding artifacts of echocardiographic images result in local missing and incomplete of the anatomical structures; Second, multi-view heterogeneous data varies in the anatomical structure, and image properties differ widely across vendors and centers; Third, in the sequence, artifacts and noise are much severer, and the motion of mitral valve, trabeculation, and papillary muscles also poses additional interference; Finally, limited labeled data restricts the performance of supervised learning based methods.

Figure 1: Top left: multi-view samples (A2C, A3C, and A4C). Top right: A4C samples across vendors and centers. Bottom row: echocardiographic sequence.

The application scenario of existing methods is always limited and only suitable under a specific situation. They mostly focus on specific view [4] or single frames (i.e., without considering the sequence) [5] or one single vendor and center [6]. As for sequence segmentation, existing methods try to leverage temporal information by using a deformable model combined with the optical flow [7, 8] or fine-tuning pretrained CNN dynamically with first frame’s label till the last frame [9]. The major downsides of these temporal methods are that they are computational cumbersome and not an end-to-end manner. The limited labeled data and specific application scenario confine the performance of existing methods and lead to the suboptimal solution.

To achieve a unified model for multi-view echocardiographic sequences segmentation, we propose a recurrent aggregation learning method (RAL). The workflow is depicted in Fig. 2. Pyramid ConvBlocks joint hierarchical ConvLSTMs are utilized to capture multi-level and multi-scale spatial-temporal information, enabling RAL the ability to harness the knowledge across heterogeneous data (multi-view, multi-center, and multi-vendor). We further introduce a double-branch aggregation mechanism for segmentation and classification to lessen gaps across multi-view data. Different from existing methods, RAL fully exploits the long term spatial-temporal information in an end-to-end manner and does not depend on any deformable model or optical flow or pretrained segmentation models. RAL can accommodate heterogeneous data, not only generate accuracy segmentation results but also achieve the classification of different views at the same time and gain prominent temporal stability.

Figure 2: Workflow overview of our method.

2 Method

RAL is built as an end-to-end framework and comprised of three key components: the feature extraction module, the segmentation branch, and the classification branch (as depicted in Fig.

2). The feature extraction module consists of pyramid dilated dense convolution blocks (ConvBlocks). The segmentation branch contains hierarchical recurrent architecture of multiple ConvLSTMs [10]. While the classification branch involves a series of aggregation downsample and fully connected layers.
Multi-Level and Multi-Scale Features Extraction. We design pyramid ConvBlocks architecture in the feature extraction module, which includes 5 ConvBlocks to extract multi-level and multi-scale features. Multi-level information provides the global geometric characteristic of the LV, while multi-scale information can help to strengthen thin and small regions, further refine the boundaries of the LV. They contribute to lessening the gap across views, vendors, and centers, increasing robustness to images conditions and the anatomical structure variations. One ConvBlock contains densely connected dilated convolution layers as shown in Fig. 3, which can expand the receptive field and meanwhile preserve the resolution of feature maps. While the transition layer changes channels and resolution of feature maps by convolution and pooling. The feedforward information propagation from preceding layers to layer can be formulated as


where are the output of the layer, refers to the concatenation of previous layers’ outputs.

is a composite function of three connected operations: batch normalization (BN), rectified linear unit (ReLU), and dilated convolution. Five ConvBlocks generate multi-level and multi-scale features

for frame in the sequence.

Pyramid ConvBlocks endow RAL with the superior feature extraction ability and the LV region detection capacity in multi-level and multi-scale space, further contribute to capturing the global geometric characteristic of the LV and then establishing uniform semantic features. Thus RAL can detect and extract the LV accurately and robustly from not only ED and ES frames but also other frames in the sequence where the boundary is not clear (disturbed by noise and other tissues, see sequence samples in Fig. 1).

Figure 3: Left: Dilated dense convolution block. Right: Hierarchical ConvLSTMs for spatial-temporal modeling.

Recurrent Features Fusion for Spatial-Temporal Modeling. For sequence segmentation, capturing the LV characteristic over time is essential for temporal stability. Recent studies based on LSTM have shown great ability to learn sequential information. Inspired by [11, 12], we conduct hierarchical ConvLSTMs to exploit long term spatial-temporal modeling as depicted in Fig. 3. We add recurrence in the temporal domain to generate prediction for frame in the sequence, which carries forward the LV information from previous frames to following frames and allows the matching between consecutive frames naturally. Additionally, we also add recurrence in the spatial domain for multi-level and multi-scale features fusion, which helps to integrate multi-level and multi-scale features efficiently.

The output of the ConvLSTM at frame depends on the following variables: (1) level and scale feature from the feature extraction module; (2) the output of preceding ConvLSTM at the same frame ; (3) the output from the ConvLSTM of previous frame ; (4) the hidden state representation from preceding ConvLSTM at the same frame , which is the spatial hidden state; (5) the hidden state representation from the ConvLSTM of previous frame , which is the temporal hidden state. The information flow can be formulated as


where is the bilinear upsampling operator. At each time step, every ConvLSTM accepts hidden states and encoded spatial-temporal features from previous ConvLSTMs and frame, the corresponding extracted feature from the feature extraction module, it then outputs encoded spatial-temporal features to next ConvLSTM and frame. Finally, predictions are generated by the last ConvLSTM at every frame.
Double-Branch Aggregation Learning. To further lessen the gaps across multi-view and refine multi-view segmentation results, we introduce a double-branch aggregation mechanism for simultaneous segmentation and classification of multi-view echocardiographic sequences as depicted in Fig. 2. Feature from the last ConvBlock is sent to the classification branch. Next, it goes through successive convolution and pooling operators to deeply aggregate with multi-level and multi-scale spatial-temporal features from the segmentation branch. Finally, the classification result is produced by fully connected layers.

The segmentation branch generates multi-view segmentations while the classification branch discriminates the specific view. They are mutually promoted by deep aggregation of multi-level and multi-scale spatial-temporal features. The segmentation branch provides multi-level and multi-scale spatial-temporal information to guide the classification while the classification branch affords multi-view discriminative regularization to refine the segmentation results and further lessen the gaps across views. This double-branch aggregation mechanism endows RAL outstanding ability to adapt complex variations of anatomical structure.

Additionally, we propose an aggregation loss to dynamically facilitate the communication between the segmentation branch and the classification branch as illustrated in Fig. 2

. The aggregation loss comprises the segmentation loss and classification loss. The segmentation loss is a combination of binary cross-entropy loss and dice loss. While the classification loss is categorical cross-entropy loss. Thus the aggregation loss function can be formulated as


where and denote ground truth and prediction of segmentation respectively, and refer to ground truth and prediction of classification separately, indicates the type of view. Besides, and are the corresponding balance coefficients, both are chosen empirically during the training process.

Vendor Machines Patients Sequences Images A2C A3C A4C Philips EPIQ 7C 60 180 5400 Sequences 100 100 100 GE VIVID E9 20 60 1800 Training 240 Philips IE33 20 60 1800 Testing 60 Total 100 300 9000 Total 300 CAMUS A2C A4C Vendor Machine Images 900 900 GE VIVID E95 Training 1600 Testing 200 Total 1800
Table 1: Specifications of our dataset (left) and the CAMUS dataset (right).

3 Experiments

Datasets. To validate the efficiency of RAL, we built a large multi-view echocardiographic sequences dataset, which was acquired from three centers’ various vendor machines (The Second People’s Hospital of Shenzhen, The Third People’s Hospital of Shenzhen, and Peking University First Hospital). We further evaluate RAL on the public CAMUS dataset [6]. Our dataset contains 300 sequences from 3 views and every sequence includes 30 frames. All 9000 frames were labeled by two experts. Fig. 4 presents A2C, A3C, and A4C sequences samples segmented by RAL and experts. While the CAMUS dataset only contains manual labels at ED and ES frames, which was acquired from a single vendor and center. Table. 1 shows the specifications of two datasets.
Evaluation Metrics. Accuracy, Dice, Mean Absolute Distance (MAD) and Hausdorff Distance (HD) are used to measure segmentation results. We further evaluate the segmentation performance with the ED, ES volume, and ejection fraction on the CAMUS dataset. We utilize the output of RAL to compute clinical indices according to standard guidelines [3].
Implementation Details. All images are resized to for computational efficiency. We employ Adam with the learning rate of 0.001 as the optimizer. The dilated rates of 5 ConvBlocks are

respectively, and every ConvBlock contains 6 layers. Besides, a dynamical decay mechanism is utilized to reduce the learning rate by monitoring the change of Dice. Ten-fold cross-validation was utilized to provide an unbiased estimation.

Figure 4: The LV contours of multi-view sequences segmented by our method (red) and experts (green). Ten frames are selected from every sequence to fit the layout view. (top row: A2C; middle row: A3C; bottom row: A4C)
Configurations Accuracy Dice HD(mm) MAD(mm) Classification
full 0.9870.005 0.9190.040 5.873.46 2.901.49 0.933
w/o classification 0.9710.008 0.9100.049 5.993.69 3.101.66 -
w/o ConvBlock 0.9630.015 0.9070.057 6.214.95 3.271.85 0.867
w/o temporal 0.9550.019 0.8960.062 6.645.04 3.511.93 0.917
w/o spatial 0.9680.011 0.9110.054 6.034.16 3.081.71 0.883
Table 2: Ablation results of our methods under different configurations.

Ablation Study. We evaluate our method under different configurations to corroborate the necessity of every component in RAL. The classification branch, ConvBlock, spatial modeling and temporal modeling are removed respectively. Table. 2

shows the ablation results, we can see that full RAL achieves higher mean values of Accuracy and Dice, lower mean values of HD and MAD, and lower standard deviations of all metrics compared against other configurations. RAL also achieves the best classification accuracy (0.933). Every single component brings important improvement for the LV segmentation, especially when adding recurrence in the temporal domain.

Comparison Study I: Geometrical. We compare RAL with U-Net, ACNN, and U-Net++ on our multi-view sequences dataset. As shown in Table. 3, RAL outperforms other methods on all metrics, achieving the highest mean values of Accuracy (0.987) and Dice (0.919), the lowest mean values of HD (5.87mm) and MAD (2.90mm), and significantly lower standard deviations of all metrics. These strongly prove that RAL is able to accomplish the best region coverage, the highest contour accuracy, and the minimum distance error when processing multi-view echocardiographic sequences across multi-vendor and multi-center.

Methods Accuracy Dice HD(mm) MAD(mm)
RAL 0.9870.005 0.9190.040 5.873.46 2.901.49
U-Net 0.9420.030 0.8830.068 8.946.87 3.721.87
ACNN 0.9590.013 0.8930.061 7.706.58 3.401.57
U-Net++ 0.9370.032 0.8800.072 9.017.14 3.862.01
Table 3: Geometrical comparison results on our multi-view echocardiographic sequences dataset.
Figure 5: Left: Mean of Accuracy, Dice, HD, and MAD at different frames of the cardiac cycle. Right: Bland–Altman analysis (EF and EF: ejection fraction calculated from automatic segmentations and manual labels)

Comparison Study II: Clinical. We compare RAL with U-Net, ACNN, and U-Net++ on the CAMUS dataset to calculate clinical indices. As shown in Table. 4, RAL obtained high correlation scores (0.952 for EDV, 0.960 for ESV, and 0.839 for EF), reasonably small biases and standard deviations, and relatively low mae (8.8ml for EDV, 7.1ml for ESV, and 5.0% for EF). Fig. 5 presents a more intuitional result by Bland–Altman plot. 94% of the measurements locate in the 1.96 standard deviation in Bland–Altman plot. These results reveal the clinical potential of RAL.
Temporal Stability. We compute the mean of Accuracy, Dice, HD, and MAD at different frames of all echocardiographic sequences and then observe the volatility of each metric to assess the temporal stability. As shown in Fig. 5, RAL achieves stable mean values of all four metrics in the cardiac cycle, only exists moderate fluctuating in the middle of the sequence. This means spatial-temporal modeling of RAL is efficient. RAL achieves not only superior segmentation accuracy but also a good coherence of consecutive frames in the sequence.
Limitation. In Fig. 5, from ED to ES frames, we observe that Accuracy and Dice decay slightly while HD and MAD increase mildly, and all metrics keep relatively stable in the diastole but show feeblish recoverability. The sequential process carries errors forward resulting in accumulation of temporal errors in the cardiac cycle. Fortunately, the fluctuating rate is moderate and the worst results are still fairly good. This limitation could be alleviated via Bi-direction LSTM.

Methods EDV ESV EF
corr bias(ml) mae(ml) corr bias(ml) mae(ml) corr bias(%) mae(%)
RAL 0.952 -7.511.0 8.8 0.960 -3.89.2 7.1 0.839 -0.96.8 5.0
U-Net 0.954 -6.911.8 9.8 0.964 -3.79.0 6.8 0.823 -1.07.1 5.3
ACNN 0.945 -6.712.9 10.8 0.947 -4.010.8 8.3 0.799 -0.87.5 5.7
U-Net++ 0.946 -11.412.9 13.2 0.952 -5.710.7 8.6 0.789 -1.87.7 5.6
Table 4: Clinical comparison results on CAMUS dataset. (EDV: ED volume; ESV: ES volume; EF: ejection fraction; corr: Pearson correlation; mae: mean absolute error)

4 Conclusion

In this paper, we present a recurrent aggregation learning method to exploit long term spatial-temporal information for simultaneous segmentation and classification of multi-view echocardiographic sequences. Multi-level and multi-scale features are recurrently aggregated on both spatial domain and temporal domain for effective spatial-temporal modeling. A double-branch aggregation mechanism further brings multi-view discriminative regularization to refine the segmentation results. Adequate experiments of geometrical and clinical evaluation demonstrate that RAL achieves not only superior segmentation and classification accuracy, prominent temporal stability, but also high correlations on clinical indices.

Acknowledgment. This work is funded by the Shenzhen Basic Research Program (JCYJ20170818164343304, JCYJ20180507182432303).


  • [1] Huang, X., et al.: Contour tracking in echocardiographic sequences via sparse representation and dictionary learning. Medical image analysis 18(2), 253–271 (2014)
  • [2]

    Madani, A., et al.: Fast and accurate view classification of echocardiograms using deep learning. npj Digital Medicine

    1(1),  6 (2018)
  • [3] Lang, R.M., et al.: Recommendations for cardiac chamber quantification by echocardiography in adults: an update from the american society of echocardiography and the european association of cardiovascular imaging. European Heart Journal-Cardiovascular Imaging 16(3), 233–271 (2015)
  • [4] Carneiro, G., et al.: The segmentation of the left ventricle of the heart from ultrasound data using deep learning architectures and derivative-based search methods. IEEE Transactions on Image Processing 21(3), 968–982 (2012)
  • [5] Chen, H., et al.: Iterative multi-domain regularized deep learning for anatomical structure detection and segmentation from ultrasound images. In: MICCAI. pp. 487–495. Springer (2016)
  • [6] Leclerc, S., et al.: Deep learning for segmentation using an open large-scale dataset in 2d echocardiography. IEEE transactions on medical imaging (2019)
  • [7] Pedrosa, J., et al.: Fast and fully automatic left ventricular segmentation and tracking in echocardiography using shape-based b-spline explicit active surfaces. IEEE transactions on medical imaging 36(11), 2287–2296 (2017)
  • [8] Zhang, N., et al.: Deep learning for diagnosis of chronic myocardial infarction on nonenhanced cardiac cine mri. Radiology 291(3), 606–617 (2019)
  • [9]

    Yu, L., et al.: Segmentation of fetal left ventricle in echocardiographic sequences based on dynamic convolutional neural networks. IEEE Transactions on Biomedical Engineering

    64(8), 1886–1895 (2017)
  • [10]

    Xingjian, S., et al.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. In: NIPS. pp. 802–810 (2015)

  • [11]

    Chen, J., et al.: Multiview two-task recursive attention model for left atrium and atrial scars segmentation. In: MICCAI. pp. 455–463. Springer (2018)

  • [12] Yang, G., et al.: Multiview sequential learning and dilated residual learning for a fully automatic delineation of the left atrium and pulmonary veins from late gadolinium-enhanced cardiac mri images. In: EMBC. pp. 1123–1127. IEEE (2018)