Fractures at the wrist and elbow are common in children and are usually diagnosed in Emergency Departments (ED) using X-ray examination , . This involves radiation exposure which is highly undesirable considering that fractures are not found in many cases. Ultrasound examination is a much safer alternative that is highly sensitive to cortical disruption ,  which could potentially reveal fractures earlier than x-rays. Various studies have validated the feasibility of using ultrasound examination for diagnosing pediatric distal radial fractures –. Meta-analysis on 1204 patients has reported 97 percent sensitivity and 95 percent for ultrasound examination .
However in clinical practice, acquiring high quality wrist ultrasound scans requires many hours of training which is challenging in ED. Often scanning is performed by non-experts resulting in low quality images which are hard to interpret. 3D ultrasound (3DUS) addresses this limitation in part by making it easier for novice users to acquire images of adequate quality when compared to 2D ultrasound . It also depicts the fracture more completely by including regions on either side of a fracture which could be useful for treatment planning of more complex anatomy. However,manual assessment of both 3DUS and 2DUS are still subjective resulting in high variability between readers. Automated assessment would eliminate such variation and lead to wider use of ultrasound in fracture detection.
The presence of noise artefacts and blurred image boundaries make automatic interpretation of ultrasound more difficult when compared to MRI, CT and X-ray. Bony structures like wrist images are even harder due to the effect of beamwidth  and shadowing which significantly alters the pixel-intensities around the bone. Compared to intensity based approaches, instantaneous local phase (LP) generally provides more robust information on the underlying structures in noisy ultrasound images 
. Local phase information alone might not be sufficient to localize the bone as there could be similar echogenic regions in close proximity. LP filtering would also lose some of the useful information in the original b-mode image. Recently, Convolutional neural network(CNN) models that fuse LP filtered images with the original b-mode image have been used for bone segmentation.
A major limitation of supervised models like CNNs is the need for a large number of labeled data which is tedious, time consuming and expensive since it requires medical expertise. Even in cases where labeled data is available there could be variability between manual annotations leading to uncertainty in ground truth. Over recent years, semi-supervised and unsupervised models have gained prominence in image analysis as they significantly reduce the dependence on labeled data. These models exploit spatial  and temporal  dependencies in the data to generate low dimensional representations. In video image analysis, unsupervised learning based on automatically detected keypoints offers an explainable framework to learn concise geometric representations [15, 16]. The transporter framework proposed in [15, 16]
automatically identifies keypoints in a video sequence by transporting features from a target frame. The loss function minimizes the Mean Square Error (MSE) between the reconstructed feature map and target image.
In this paper, we propose an unsupervised transporter neural network to detect keypoints from wrist ultrasound images. A key contribution in our approach is integration of a bone probability map that allows the model to focus on specific regions in the image filtered by local phase at multiple scales. We use a new Acoustic Feature Fusion Convolutional Neural Network (FF-CNN) to generate features relevant to ultrasound based on the bone probability map.This would be first work to use unsupervised keypoint detection in ultrasound sweeps.
An overview of the proposed neural network architecture is shown in Figure 1. For each frame in the 3DUS video, the Acoustic FF-CNN and KeyNet generate a feature-map () and point-map () respectively. Using the transportation technique described in [15, 16], features from the feature map of a target frame () at their corresponding keypoint locations () are embedded into the feature map () at keypoint locations described in (). A generative model (RefineNet) uses this augmented feature map to reconstructed the target frame .
2.1 Time Gain Attenuation (TGA)
We use TGA to pre-process the image and suppress unwanted bright regions which generally occur due to acoustic reflection. TGA applied to image can be represented as where represents a depth dependent decay mask whose value at depth , i.e, depends on an exponential attenuation factor .
2.2 Bone Probability Map
We extract local phase information from the TGA compensated image using a Gabor filter bank where is a scaling parameter that can be varied as to capture image features at different resolutions.
As shown in Figure 1 the output of the Gabor filter bank analyzes the image under various frequency bands. We use the monogenic signal analysis described in  to generate the bone probability map. In monogenic analysis  we model an image using a combination of amplitude and phase as
The corresponding local phase filtered image can be written in tensor representation using symmetric features (), asymmetric features () and instantaneous phase as :
Tensors of symmetric and asymmetric features ( and )are computed using Hessian , Gradient , and Laplacian operations as shown below:
We calculate three monogenic signals , and by applying the Riesz transform on LPT as described in  . Using these monogenic signals we define the Local Phase(), feature symmetry () and pixel wise Integrated Backscatter map () :
represents the integrated back scatterer map along the row direction. Finally we combine these to generate a bone probability map :
2.3 Acoustic FF-CNN
The Acoustic FF-CNN combines the multi-channel images generated with the bone probability maps at different scales as in . Since the input is the multi-channel multi resolution images, the acoustic FF-CNN () could be considered as a multi-resolution encoding network.
The KeyNet also uses the same input as the FF-CNN and generates a gaussian heat map of keypoints.
2.6 Ultrasound Scanning
We prospectively collected ultrasound scans from 30 children aged less than 17 years presenting at an emergency department with suspected unilateral fracture at the wrist. We collected single 2D images, sweeps and 3DUS images from both the affected and unaffected wrist along with conventional x-ray examination which was used to obtain gold standard diagnosis.
Images were acquired on a Philips iU22 machine (using a 13-MHz 13VL5 probe for 3DUS) with the child seated in a neutral position. During examination, the injured wrist was scanned on the dorsal (DS) and volar (VO) surfaces in both the sagittal and axial orientation resulting in four 3D scans of each wrist (DS sagittal,DS axial, VO sagittal and VO axial). While acquiring the 3DUS image the sonographer centered the view on the distal end of the radius in the different orientations. Each sweep was of 3.2 seconds duration through a range of +/- to with ultrasound slices of 0.2mm. As a baseline, the same scanning protocol was followed for the unaffected wrist as well.
2.7 Training Details
The Transporter architecture was implemented in PyTorch v1.7.1 and trained for 100 epochs with Adam optimizer,learning rate = 0.001 decaying by 0.95/10 epochs, batch size = 16, training set = 1024 pairs of images and validation set = 512 for detecting 10 keypoints. The remaining hyper parameters were adopted from. Samples were drawn from grayscale Wrist Ultrasound videos resized to 256x256 resolution at 25 FPS as pairs of frames separated by i=4 frames( refer Figure 1).
During evaluation, each frame is forward passed through the KeyNet and every th output channel up-sampled from 64x64 resolution to 256x256 in order to plot the th keypoint on the input frame.
We validated our technique on 56 3DUS videos (containing 256 frames each) of ultrasound wrist. 22 / 30 scans had fractures visible in at least one view. Three human readers with varying years of expertise examined each ultrasound video along with the corresponding x-ray and reported fractures. Ground truth was established based on consensus between the human readers.
3.1 Bone Probability Map
TGA compensation was applied to all frames before generating the bone probability maps (refer Figure 2C). It can be seen from the figure that the bone probability map accurately captures features of the bone that are involved in the fracture.Location of the fracture as seen in ultrasound, along with the corresponding x-ray is shown in Figures 2A and 2D.
3.2 Keypoint Detection
In affected as well as unaffected wrists the neural networks correctly identified the top portion of the bone with multiple keypoints (refer Figure 2). In affected wrists (Figure 2B) the network was able to track keypoints near the fracture. We manually selected a tight rectangular region of interest around the bone and the neural network correctly identified key points within this region in 180 / 250 ultrasound frames.
3.3 Ablation Studies
In order to determine the optimal parameters for various network components, we performed ablation studies. Specifically, we used various encoder models for FF-CNN and KeyNet with identical TGA compensation and monogenic analysis for all abalations. Highest accuracy in detecting fractures was obtained using 6 convolutional blocks for FF-CNN, 6 convolutional blocks with a regressor for KeyNet and 6 convolutional blocks for RefineNet. Each convolutional block consisted of a convolutional layer, ReLU activation and batch normalisation.
We described a transporter neural network with components tailored to domain specific features seen in ultrasound. We applied our model on wrist ultrasound which is a challenging use case due to acoustic shadowing near bony structures. Instead of using end-to-end deep learning we introduced features specific to ultrasound as a bone probability map. As a preprocessing step, we also compensated for TGA using an exponentially decaying depth dependent model. As shown in Figure 2, TGA compensation was able to suppress most of the bright patches that were close to the transducer surface and highlight the bony structures located deeper in the image.
Conventional DL models in ultrasound use supervised learning which requires precisely labeled ground truth annotations. Our keypoint detection framework uses unsupervised learning and relies on information from neighbouring frames. Using the bone probability map we ensure that the neural network learns features and keypoints that are relevant in detecting wrist fractures. A key advantage of our framework is the ability to indicate the approximate location of the fracture using keypoints.
Since keypoints are detected automatically we were able to train our model with a relatively small number of videos (N 50). Our technique could be used at nearly real-time with an execution time 1 sec per volume using an NVIDIA V100 GPU.
Although the approach has been explained in context of wrist ultrasound, it can be extended to other anatomical structures that contain bone. For instance, it can be adapted to analyse ultrasound sequences of elbow or shoulder to detect fracture or ligament tears. With minimal modification it could also be used for other scans where relevant artifacts can be detected using local phase. For example, lung ultrasound (LUS) images contain horizontal (A-lines) and vertical artifacts(B-lines) that can be detected using local phase.
As future work we plan to incorporate a classification head in the transporter architecture which could be trained to detect fractures reported by human readers. We would develop a hybrid loss function which combines MSE and cross entropy for frame reconstruction and binary classification. Features learned from the generative RefineNet model could be used to initialize the new classification head, there is also the possibility of obtaining even more relevant keypoints for fracture detection by providing the FF-CNN features at the keypoint locations to the classification head with gradient backpropagation through the keypoint locations.
Our study has limitations, firstly our dataset is limited (N=30) and was collected from a single center which limits the generalizability of our results. As future work we plan a large scale multicenter study to validate the AI technique on ultrasound sweeps (which are more common in practice than 3DUS). Another limitation of our tracking technique is that we have not specifically addressed motion artifacts that are commonly seen in ultrasound scans. Although in most cases these artefacts occupy higher frequencies they could potentially affect the reconstruction of the feature map and result in incorrect keypoints. This can be addressed by associating spatial and temporal saliency features to each keypoint. Lastly, not all fractures are visible in all four views which could result in variability in ground truth on a per-patient basis. With a larger dataset we would be able to address this limitation as we would be able to develop models specifically trained on each scan view.
This is the first use of an unsupervised transporter neural network to detect pathology in ultrasound. Our approach can be used in emergency care to assess 3D wrist ultrasound scans and identify relevant keypoints. We incorporated ultrasound specific features in an unsupervised learning framework and identified clinically relevant features. This automatic technique significantly reduces interobserver variability and would potentially result in more widespread use of ultrasound. Replacing X-ray examination with ultrasound would reduce radiation exposure and and improve the overall quality of emergency care.
-  K. W. Nellans, E. Kowalski, and K. C. Chung, “The epidemiology of distal radius fractures,” Hand Clin., vol. 28, no. 2, pp. 113–125, May 2012.
-  A. Slaar, A. Bentohami, J. Kessels, T. S. Bijlsma, B. A. van Dijkman, M. Maas, J. C. H. Wilde, J. C. Goslings, and N. W. L. Schep, “The role of plain radiography in paediatric wrist trauma,” Insights Imaging, vol. 3, no. 5, pp. 513–517, Oct. 2012.
-  S. H. Lee and S. J. Yun, “Diagnostic Performance of Ultrasonography for Detection of Pediatric Elbow Fracture: A Meta-analysis,” Ann. Emerg. Med., vol. 74, no. 4, pp. 493–502, Oct. 2019.
-  A. C. Epema, M. J. B. Spanjer, L. Ras, J. C. Kelder, and M. Sanders, “Point-of-care ultrasound compared with conventional radiographic evaluation in children with suspected distal forearm fractures in the Netherlands: a diagnostic accuracy study,” Emerg. Med. J., vol. 36, no. 10, pp. 613–616, Oct. 2019.
-  R. Rowlands, J. Rippey, S. Tie, and J. Flynn, “Bedside Ultrasound vs X-Ray for the Diagnosis of Forearm Fractures in Children,” J. Emerg. Med., vol. 52, no. 2, pp. 208–215, Feb. 2017.
-  H. Hedelin, C. Tingström, H. Hebelka, and J. Karlsson, “Minimal training sufficient to diagnose pediatric wrist fractures with ultrasound,” Crit. Ultrasound J., vol. 9, no. 1, p. 11, Dec. 2017.
-  I. Galletebeitia Laka, F. Samson, I. Gorostiza, A. Gonzalez, and C. Gonzalez, “The utility of clinical ultrasonography in identifying distal forearm fractures in the pediatric emergency department,” Eur. J. Emerg. Med., vol. 26, no. 2, pp. 118–122, Apr. 2019.
-  D. Douma-den Hamer, M. H. Blanker, M. A. Edens, L. N. Buijteweg, M. F. Boomsma, S. H. van Helden, and G.-J. Mauritz, “Ultrasound for Distal Forearm Fracture: A Systematic Review and Diagnostic Meta-Analysis,” PLoS One, vol. 11, no. 5, p. e0155659, May 2016.
-  E. Mostofi, B. Chahal, D. Zonoobi, A. Hareendranathan, K. P. Roshandeh, S. K. Dulai, and J. L. Jaremko, “Reliability of 2D and 3D ultrasound for infant hip dysplasia in the hands of novice users,” Eur. Radiol., vol. 29, no. 3, pp. 1489–1495, Mar. 2019.
-  I. Hacihaliloglu, “Ultrasound imaging and segmentation of bone surfaces: A review,” Technology, vol. 05, no. 02, pp. 74–80, Jun. 2017.
-  I. Hacihaliloglu, A. Rasoulian, R. N. Rohling, and P. Abolmaesumi, “Local phase tensor features for 3-D ultrasound to statistical shape+pose spine model registration,” IEEE Trans. Med. Imaging, vol. 33, no. 11, pp. 2167–2179, Nov. 2014.
-  A. Z. Alsinan, V. M. Patel, and I. Hacihaliloglu, “Automatic segmentation of bone surfaces from ultrasound using a filter-layer-guided CNN,” Int. J. Comput. Assist. Radiol. Surg., vol. 14, no. 5, pp. 775–783, May 2019.
-  A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural Discrete Representation Learning,” arXiv [cs.LG], 02-Nov-2017.
C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430.
-  T. D. Kulkarni, A. Gupta, C. Ionescu, S. Borgeaud, M. Reynolds, A. Zisserman, and V. Mnih, “Unsupervised learning of object keypoints for perception and control,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 10724–10734, 2019.
T. Jakab, A. Gupta, H. Bilen, and A. Vedaldi, “Self-supervised learning of interpretable keypoints from unlabelled videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8787–8797.
-  I. Hacihaliloglu, “Enhancement of bone shadow region using local phase-based ultrasound transmission maps,” Int. J. Comput. Assist. Radiol. Surg., vol. 12, no. 6, pp. 951–960, Jun. 2017.