Recently, lung ultrasound (LUS) imaging has increased in popularity for rapid lung monitoring in patients in the intensive care units (ICU). Particularly for dengue patients, LUS can capture image artefacts such as B-lines that indicate a pulmonary abnormalities such as oedema and effusions . B-lines are bright lines extending from the surface of the lung distally following the direction of propagation of the sound wave (shown in Figure 1). LUS imaging is useful for assessing lung abnormalities though the presence of B-lines. However, these lines become visible randomly during respiratory cycle in the affected area only ; therefore, manually detecting these artefacts becomes challenging for inexperienced sonographers, and particularly in low and middle income countries with higher prevalence of these diseases where training opportunities and experise are scarce.
In order to provide an automatic solution to the LUS B-line detection problem, recent studies proposed classification, segmentation and localization of B-line artefacts in individual LUS frames. For example, a convolutional neural network (CNN) followed by a class activation map was proposed by Sloun et al.4]. In another study , a single-shot CNN was used to localize B-lines with bounding boxes. Previous work by Kerdegari et al.  showed that employing temporal information can improve B-line detection task in LUS, leveraging a temporal attention mechanism to localize B-line frames within LUS videos.
Furthermore, attention mechanisms have been used widely for spatial localization of lung lesions particularly in CT and x-ray lung images. For instance, a residual attention U-Net for multi-class segmentation of CT images  and x-ray images 
was proposed. A lesion-attention deep neural network (LA-DNN) was presented by to do two tasks of B-line classification and multi-label attention localization of five lesions. All these studies employed spatial attention mechanism for lung lesion localization.
LUS images are usually used in a standard Cartesian coordinate representation (i.e., scan-converted). In this representation, the B-lines commonly appear densely in the middle of frustum. Therefore, data preprocessing techniques such as downsampling might cause information loss with Cartesian representation. Additionally, the radial direction that B-lines follow is known but this information is not exploited. In this paper, we propose to use a polar representation to, first, reduce information loss when downsampling the data, and second, leverage prior knowledge about line formation by having one dimension aligned with the lines.
To this end, we compare the performance of the temporal attention-based convolutional+LSTM model proposed by  when using Cartesian and polar representations. In summary, the contribution of this paper is investigating the effect of using LUS polar coordinate representation on the B-line detection and localization performance. Also, we evaluate the effect of different downsampling factors of LUS video with polar and Cartesian representations for B-line detection and localization tasks.
2 Model Architecture
This paper employs a model that combines a deep visual feature extractor such as a CNN with a long short-term memory (LSTM) network that can learn to recognize temporal dynamics of videos; and a temporal attention mechanism to learn where to pay more attention in the video. Figure2 shows the core of our model. This model works by passing each frame from the video through our CNN model (The architecture details are explained in Figure 2
, right) to produce a fixed length feature vector representation. The outputs of our CNN are passed into a bidirectional LSTM (16 hidden units, tanh activation function) network as a recurrent sequence learning model. Then, the LSTM outputs are passed to the attention network to produce an attention score () for each attended frame (): , where represents attention layer weight matrix. From , an importance attention weight () is computed for each attended frame: . To learn which frame of the video to pay attention to, s are multiplied with the LSTM output. Finally, the output of LUS video classification is generated by averaging the attention weighted temporal feature vector over time and passing to a fully connected layer for video classification.
3 Experimental Setup
3.1 Dataset and Preprocessing
The dataset used in the experiments was collected at the Hospital of Tropical Diseases in Ho Chi Minh City, Vietnam. It includes about 5 hours of lung ultrasound videos collected from 60 dengue patients. These videos were collected using a Sonosite M-Turbo machine (Fujifilm Sonosite, Inc., Bothell, WA) with a low-medium frequency (3.5-5 MHz) convex probe. The Kigali ARDS protocol , as a standardised operating procedure was applied at 6 points (2 anterior, 2 lateral and 2 posterolateral) on each side of the chest to perform LUS exams.
The four-second LUS video clips have been resized from original size of pixels to pixels for training, and fully anonymised through masking. A qualified sonographer annotated these clips using the VGG annotator tool . During the annotation procedure, each video clip was annotated by being assigned either a B-line or non-B-line label. Further, B-line frames and B-line regions in the B-line video clips were annotated to be used as annotations for temporal and spatial B-line localization task later.
3.2 Polar Coordinate Representation
Like other common applications of ultrasound imaging, lung ultrasound images are normally presented in Cartesian coordinates (shown in Figure 3, left). In this case, the information particularly B-line artefacts are presented densely in the centre of the frustum to some extend. Therefore, when we downsample the LUS videos as input to our network some information are lost. To overcome this limitation, we transform each video clip into its associated polar coordinate representation (shown in Figure 3, right). With polar coordinate representation, information are expanded along the degree axes of polar data; therefore, less information are missed during downsampling of the data. Additionally, there is not much information in the left and right up corner (black areas) of Cartiesian coordinate representation. As a result, when these areas are removed in the polar coordinate representation, the network can concentrate on the areas of each frame where more useful information are exist.
Polar representation is carried out by the following reparameterization, used to resample the Cartesian images into a polar grid using bilinear interpolation:
Where is the depth, or radius (distance form the beam source to a pixel location) and is the angle measured from the y axis.
3.3 Implementation Details
. Batch normalization was used for both CNN and LSTM parts of the network. Batch size of 25, dropout of 0.2 and
for regularization were employed. Data augmentation was applied to the training data by adding horizontally-flipped frames. 5-fold cross validation was used and the network converged after 60 epochs. The class imbalance was addressed by weighting the probability to draw a sample by its relative class occurrence in the training set.
4 Experiments and Results
In order to investigate the potential benefit of employing polar representation and various video resolutions in B-line detection task, we trained our model with Cartesian and polar representations using various input video sizes of , and resolution. Furthermore, we reduced the depth size of polar data to 32 and 16 samples, while keeping the number of angular elements to 64 (hence maintaining angle resolution), therefore having the video size of and resolution for training.
. An alpha value of 0.05 was selected as the statistical significance threshold. Shapiro-Wilk test showed that all data were normally distributed.
Our baseline video resolution (
) received the highest performance for both polar and Cartesian representations. Also, a paired t-test revealed that the performance of polar data (83.5%) is significantly higher than Cartesian data (81%) (t=2.776, p=0.017) in all cases with the same number of pixels. This demonstrates that the model can extract more information from a polar representation. When we decreased the video resolution intoand , the performance dropped compared to the baseline video resolution, although the drop was less significant in polar images. For video resolutions of , paired t-test showed significant difference between Cartesian and polar representation in B-line detection task (t=1.035 , p=0.028). However, this difference is not significant for video resolution of (t=-1.104, p=0.165), probably because the downsampling is too aggressive and B-lines become barely distinguishable in any representation. Furthermore, we decreased the depth size of polar data into 32 and 16 to evaluate the contribution of depth information in B-line detection. Compared to the depth size of 64 in baseline resolution, the performance decreased significantly for both depth sizes of 32 (t=2.835 , p=0.008) and 16 (t=1.503 , p=0.018).
Additionally, we investigated the impact of downsampling along scan-lines and along angles. To do this, we compared two video resolutions that had the same amount of pixels: and . Results showed that video resolution of (64 along the angle dimension) has significantly higher performance which shows that preserving information along the angle dimension helps in this specific task where artefacts are aligned along constant-angle lines (t=2.43, p=0.03).
We further evaluated B-line temporal localization accuracy using both data representations. We calculated intersection over union (IoU) of predicted temporal localized frames with their ground truth annotations. Results are presented in Table 1. With polar representation, the model is able to localize B-line frames temporally with higher performance than Cartesian representation. Additionally, the attention weights for true B-line frames are higher in polar representation and for the non B-line frames lower, compared to Cartesian representation, further suggesting that the network learns to differentiate B-line and non B-line frames better in polar representation.
This paper investigates the effect of employing ultrasound polar coordinate representation on LUS B-line detection and localization tasks. We employed an attention-based convloutional+LSTM model capable of extracting spatial and temporal features from LUS videos and localizing B-line frames using a temporal attention mechanism. We evaluated B-line classification and localization with this architecture using Cartesian and polar coordinate representations with different resolutions. Using our LUS video dataset, results showed that polar representation consistently outperforms Cartesian in terms of classification accuracy and temporal localization accuracy.
Our future work will explore an spatiotemporal attention mechanism that is able to detect the B-line artefacts and localize them both spatially and temporally within LUS videos in polar coordinates. B-line spatial localization may help clinicians to quantify the severity of the disease. Overall, these findings will assist management of ICU patients with dengue particularly in low- and middle-income countries where ultrasound operator expertise is limited.
The VITAL Consortium: OUCRU: Dang Trung Kien, Dong Huu Khanh Trinh, Joseph Donovan, Du Hong Duc, Ronald Geskus, Ho Bich Hai, Ho Quang Chanh, Ho Van Hien, Hoang Minh Tu Van, Huynh Trung Trieu, Evelyne Kestelyn, Lam Minh Yen, Le Nguyen Thanh Nhan, Le Thanh Phuong, Luu Phuoc An, Nguyen Lam Vuong, Nguyen Than Ha Quyen, Nguyen Thanh Ngoc, Nguyen Thi Le Thanh, Nguyen Thi Phuong Dung, Ninh Thi Thanh Van, Pham Thi Lieu, Phan Nguyen Quoc Khanh, Phung Khanh Lam, Phung Tran Huy Nhat, Guy Thwaites, Louise Thwaites, Tran Minh Duc, Trinh Manh Hung, Hugo Turner, Jennifer Ilo Van Nuil, Sophie Yacoub. Hospital for Tropical Diseases, Ho Chi Minh City: Cao Thi Tam, Duong Bich Thuy, Ha Thi Hai Duong, Ho Dang Trung Nghia, Le Buu Chau, Le Ngoc Minh Thu, Le Thi Mai Thao, Luong Thi Hue Tai, Nguyen Hoan Phu, Nguyen Quoc Viet, Nguyen Thanh Nguyen, Nguyen Thanh Phong, Nguyen Thi Kim Anh, Nguyen Van Hao, Nguyen Van Thanh Duoc, Nguyen Van Vinh Chau, Pham Kieu Nguyet Oanh, Phan Tu Qui, Phan Vinh Tho, Truong Thi Phuong Thao. University of Oxford: David Clifton, Mike English, Heloise Greeff, Huiqi Lu, Jacob McKnight, Chris Paton. Imperial College London: Pantellis Georgiou, Bernard Hernandez Perez, Kerri Hill-Cawthorne, Alison Holmes, Stefan Karolcik, Damien Ming, Nicolas Moser, Jesus Rodriguez Manzano. King’s College London: Alberto Gomez, Hamideh Kerdegari, Marc Modat, Reza Razavi. ETH Zurich: Abhilash Guru Dutt, Walter Karlen, Michaela Verling, Elias Wicki. Melbourne University: Linda Denehy, Thomas Rollinson.
-  Gino Soldati, Marcello Demi, and Libertario Demi., “Ultrasound patterns of pulmonary edema,” Annals of Translational Medicine, vol. 7, no. 1, 2019.
-  Christoph Dietrich et al., “Lung b-line artefacts and their use,” Journal of Thoracic Disease, vol. 8, no. 6, pp. 1356, 2016.
Van Sloun, Ruud JG, and Libertario Demi.,
“Localizing b-lines in lung ultrasonography by weakly supervised deep learning, in-vivo results,”IEEE JBHI, vol. 24, no. 4, pp. 957–964, 2019.
-  S. Roy et al., “Deep learning for classification and localization of covid-19 markers in point-of-care lung ultrasound,” IEEE TMI, 2020.
-  S. Kulhare et al., “Ultrasound-based detection of lung abnormalities using single shot detection convolutional neural networks,” in MICCAI-PoCUS, pp. 65–73. 2018.
-  Hamideh Kerdegari et al., “Automatic Detection of B-lines in Lung Ultrasound Videos From Severe Dengue Patients,” 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 989–993, 2021.
-  Xiaocong Chen, Lina Yao, and Yu Zhang., “Residual attention u-net for automated multi-class segmentation of covid-19 chest ct images,” arXiv:2004.05645, 2020.
-  Gusztáv Gaál, Balázs Maga, and András Lukács., “Attention u-net based adversarial architectures for chest x-ray lung segmentation,” arXiv:2003.10304, 2020.
-  Bin Liu, Xiaoxue Gao, Mengshuang He, Fengmao Lv, and Guosheng Yin., “Online covid-19 diagnosis with chest ct images: Lesion-attention deep neural networks,” medRxiv, 2020.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio., “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  E. D. Riviello et al., “Hospital incidence and outcomes of the acute respiratory distress syndrome using the kigali modification of the berlin definition,” Am. J. Respir. Crit. Care Med., vol. 193, no. 1, pp. 52–59, 2016.
-  Abhishek Dutta, and Andrew Zisserman., “The VIA annotation software for images, audio and video,” in ACM Multimedia, 2019.