Ultrasound scanning is an important step in many medical diagnostic and therapeutic workflows due to its well established safety record, its ability to visualize differences among soft tissues, and portability [16, 20]. However, ultrasound scanning is labor intensive where a scanning session can take up to 30 minutes. Ultrasound scans are also sonographer dependent, creating relatively high cross operator variability in accurate anatomical structure identification; novice sonographers demonstrate high diagnostic error rates at up to more than expert sonographers . Ultrasound imaging is primarily used to image soft tissue, which is inherently compressible, creating additional within-subject image variability. Despite these limitations, the trained clinician must carry out accurate and precise ultrasound scanning as it is critical for identification of targeted structures, as well as for precise and accurate therapy administration . An automatic framework tool that assists sonographers in detecting and localizing anatomical structures may radically improve reliable across-subjects scanning for both novice and expert sonographers.
Object detectors are designed to localize objects and identify their underlying category or class within an image [31, 13]. Before the era of deep learning (DL), many traditional object detection algorithms used handcrafted features to detect objects that usually did not generalize well to real life situations. However, some prominent traditional methods, such as the Viola-Jones Detectors , Histogram of Oriented Gradients (HOG) , and Deformable Part-based Model (DPM)  were quite successful. Nevertheless, since the introduction of DL based object detection algorithms, they outperformed traditional methods on every significant performance metric [13, 25].
In medical imaging, object detection problems have been historically tackled using region of interest (ROI) tracking or segmentation-based approaches. For ROI tracking, several methods have been developed such as block matching  where exhaustive search-based block matching (ES-BM) is used to track anatomical structures such as arteries across sequential frames , elliptical shape fitting to track and localize arteries and veins , and deep learning methods using networks that compare similarities between frames . Even though these ROI tracking methods have shown great potential in tracking objects in ultrasound scans, their ability to assist sonographers in detecting and localizing target anatomical structures during scanning sessions is hindered by their slow inference speeds [4, 5], or their dependency on operators to identify the target ROI at the beginning of a scanning session , or both.
Segmentation-based approaches are used to recover a pixel-wise representation of every part within an image that belongs to an object 
. Often, the goal of such algorithms is to identify the presence of an object in a medical scan, localize it, and estimate its size. These three goals are achievable through object detection algorithms, where the annotations used for training can be generated at a rate that is orders of magnitude faster than pixel-wise annotations used in segmentation algorithms.
In this paper, we propose a real-time object detection framework that is designed to autonomously detect, identify, and localize a specific anatomical structure in ultrasound scans. The specific anatomical structure we will identify is the cervical Vagus nerve encased within Carotid artery sheath. The Carotid sheath encapsulated Vagus nerve sits at a variable depth (dependent on transducer probe to skin pressure) of approximately 1.2 to 2.5 cm (at high and low cervical transducer skin pressure due to jugular vein compression) [18, 21]
. The proposed method uses a weakly supervised and modified U-Net convolutional neural network (CNN) as its backbone detection and localization algorithm. It is designed to autonomously assist sonographers in real-time to enhance their ability to detect and track objects of interest during scanning sessions. We show that the proposed method outperforms YOLOv4  and EffecientDet , the current state-of-the-art real-time object detection methods, in detecting the Vagus nerve.
Ii Related Work
Ii-a DL Based Object Detection
Deep learning based object detection methods, and specifically CNN based methods, are currently the state-of-the-art [25, 13]. These detectors can be categorized into two broad categories, the two stage detectors such as Faster R-CNN , and the one stage detectors such as SSD  and YOLO . Two stage detectors that defined the early success of DL based methods, are designed to have high identification and localization accuracy, while one stage detectors are designed to be fast and operate in real-time at 30+ frames per seconds (fps). Recently, these real-time methods achieved state-of-the-art object detection accuracy with a performance that is as good as, or better than, two stage methods .
However, both one and two stage detectors require large training sets. If one and two stage detectors are trained on smaller data sets (that is often times the case in medical imagery); overfiting and poor generalization can occur .
Ii-B Object Detection in Medical Imaging
Autonomous object detection in medical imaging has been historically treated as a segmentation problem. Pixel-wise annotations based on the presence of an object within an image are used to train detector algorithms to identify that object. In ultrasound, geometrical shapes fitting such as ellipse fitting for the detection of vessels, and contextual and textural features have been used to detect objects through a segmentation-based approach [14, 27].
Most of the current advances in medical images’ segmentation are based on DL approaches. One specific approach, the U-Net, that uses a contracting path followed by an expansive path, with skip connections between the two paths to preserve high resolution features localization, has proven to be highly trainable with a small training set making it very suitable for usage with medical images . Several improvements have been made on the design of U-net, such as utilizing increased skip connections and deep supervision, have improved the accuracy performance of the network at the expense of segmentation time .
Even though highly efficient in learning from smaller training sets when compared to object detectors, these segmentation approaches are expensive in their need for detailed annotations that require highly experienced medical personnel. Hence, a weakly supervised object detection framework that uses a modified U-Net trained as a backbone can achieve high accuracy with minimal supervision, reducing the cost of deployment while maintaining the capability to accurately detect and localize target objects.
Iii Proposed Framework
Our proposed framework to detect, localize and track a specific target anatomical structure in real-time consists of 4 stages, as outlined in Fig. 2. The 1st stage is designed to pre-process the scans, the 2nd stage to detect and localize the target object within a scan, the 3rd
stage to classify whether a scan contains the target object, and the 4th and final stage to fine tune the detection parameters. The framework is designed to have an inference latency of less than 33 ms when operating on a medium range graphical processing unit (GPU) such as the Nvidia RTX 2080 Ti.
Iii-a Stage 1: Pre-Processing
In this stage, the frames are prepared for the backbone network in stages 2 and 3. The current ultrasound frame is stacked with the previous two frames as a three channel tensor of size, where and are the height and width of the scan (frame). Using 3 frames instead of 1 has improved the localization accuracy by . The frames intensity values are then normalized to be in the range .
During training, we used extensive data augmentation to improve our framework’s ability to overcome overfitting, and generalize to data outside of the training set [9, 33]. The augmentation pipeline includes geometrical transformations and color space randomized contrast and brightness transformations to account for differences in ultrasound signals’ energy levels. Most importantly, the pipeline deployed: 1) deformable elastic transformations 
with random Gaussian kernels to elastically deform the grid of an image, which simulates the elastic differences among soft tissues within and across subjects, and 2) mixtures of input scans to enhance the coverage of the probability space while minimizing the risk function during training by implementing vicinal risk minimization instead of empirical risk minimization. While computationally efficient, the empirical risk defined as only considers the performance of on a finite set of training examples for a dataset consisting of training data with examples of input () and target () pairs, prediction algorithm
, and a loss function. The empirical risk is used to approximate the expected risk, which is the average of the loss function
over the joint distribution of inputs and targets, where the joint distribution is only known at the training examples and can be approximated by the empirical distribution . However, the distribution can be approximated by where is a vicinity distribution that computes the probability of finding the virtual input-target pair in the vicinity of the training input-target pair . The virtual input-target pair can be defined as and , where and are two randomly selected input-target pairs from the training set, and
is sampled from a beta distribution (Beta, ). This approximation offers a more comprehensive representation and coverage of the joint distribution . The use of mixtures of inputs have been implemented in image classification problems to minimize vicinal risk . In our framework, we have built and implemented an approach to use mixtures of inputs to minimize vicinal risk for segmentation-based algorithms.
In this stage, the masks that are used to train the network are created from bounding box coordinates as images (tensors) of size where pixel values within the bounding box are set to 1. This mask will be used to weakly train the network in stage 2 to detect and localize the presence of target objects within the boundaries of the box.
Iii-B Stage 2: Backbone Detection Network
Our framework’s backbone network is designed based on a modified U-net architecture as shown in Fig. 2. The proposed network uses 4 depth levels as the standard U-Net with 2 convolutional layers in each depth level as well as the bridge of the network. However, in our proposed framework we used 32, 64, 128, 256, and 512 channels in the feature maps at levels 1, 2, 3, 4, and the bridge, respectively. The original method used twice the number of feature maps channels at each of these levels. Reducing the number of channels allows the network to operate in real-time.
Reducing the size of a neural network usually reduces performance. To cope with this, we incorporated several modifications to improve the performance of the network such as two dimensional (2D) dropout layers in addition to original dropout layers. 2D dropout layers regularize the activations more efficiently when high correlation exists among pixels that are close to each other 
. We also incorporated batch normalization layers and added a localization promoting term to the cost function. The original cost function of U-net is a confidence promoting loss function that computes the binary cross-entropy (BCE) between each pixel of the ground truth and predicted mask. For each element of the predicted mask with a valueand ground truth value at location , where and , the BCE cost function can be computed for each training batch as:
where is the size of the batch, is the number of elements in the mask and is equal to , is the predicted mask, is the ground truth mask, and is the loss computed element-wise between the ground truth and predictions, and is defined as:
In (2), the weight adjusts the loss function penalization for class based on the training set size imbalance for each class, and
is the sigmoid function defined asand it maps the predicted elements into a probability space of predictions where constitutes an object if larger than or equal to 0.5, and background otherwise. The weight and the threshold of are used to influence the precision and recall of the network. The loss function promoting localization is based on the dice coefficient between the predicted and ground truth mask. The dice coefficient () is defined as :
where represents the element-wise multiplication and . The dice coefficient loss can then be defined to penalize lower values, which yields lower localization performance, as:
The overall object detection loss function is defined as:
where and are coefficients that control the contribution of BCE loss and Dice loss, respectively, to the overall loss function. In our implementation we chose and .
Iii-C Stage 3: Classifier
Stage 2 is designed to localize an object within a scan, but is not optimized to identify the presence of the target object in the scan. Thus, to detect whether the target object is in the scan or not, we use a classifier optimized for this task as can be seen in Fig. 2. The classifier adds two extra layers to the framework and uses the output of the last layer of the bridge in stage 2, which contains 512 feature map channels, as input. This input is flattened to a tensor of length 512 using the average global pooling layer that was proposed as part of ResNet . This is then followed by a fully-connected layer and an output layer for the 2 classes activated by a softmax function where the BCE loss (defined in (1)) is used to train the classifier.
Iii-D Stage 4: Post-Processing
The output mask of the backbone network from stage 2 will be of size . After being threshold by the sigmoid function, the output mask will have elements with values between 0.5 and 1, as well as 0. These elements where the value is higher than 0.5, represent the region in which the network believes the target object is located. The average of the locations of these elements weighted by the confidence, which is the output of the sigmoid function , is used to estimate the center location of the target object. The center of the target location can be then estimated as follows:
is the number of elements where the confidence is higher than the threshold, and
. The weighted standard deviation of these elements’ locations is used to estimate the width and height of the target object, and can be defined as follows for:
where is the standard deviation in the x direction. can be calculated using (7) by replacing the corresponding variables. The width and height of the bounding box can then be calculated as: and , where and are factors that are learned during the training of the backbone network. The output of the classifier is then fed together with the output of the backbone network to a decision logic such as an ”or” or ”and” to decide on the presence of the target object in the scan. Choosing ”and” will increase precision at the expense of recall, and vice versa. Controlling this decision logic together with the thresholds () for the backbone network and classifier can be used in real-time by sonographers as simple methods to control the rate of false positives or false negatives.
Iv Experiments And Results
We evaluated our model on two different ultrasound datasets that were created by researchers at UC San Diego Health and Jacobs School of Engineering. The scans in the datasets were acquired to image the Vagus nerve in the mid-cervical and upper-cervical regions of the neck. The scans span different fields of view of the neck to create a variety of scans that would be generated by a sonographer who is looking to image the Vagus nerve within the neck. The two datasets were created using different probes and image reconstruction devices. The 1st dataset was created using a probe and device with high quality diagnostic capabilities. The 2nd dataset used a probe that has a small footprint to work alongside non-invasive therapeutic and stimulation devices, and is designed to generate scans at a very rapid pace at the expense of quality. The 1st dataset contained 6,368 scans from 3 different subjects, while the 2nd contained 26,313 scans from 5 different subjects. Both datasets contained scans from both the left and right side of the neck. The Vagus nerve shape, location and surrounding anatomical structures varies greatly within subjects and across subjects. Even a slight movement of the probe can make it challenging for sonographers to re-identfy the nerve and its location due to the high variability of neck anatomy visualized with medio-lateral or cepahlo-caudal scanning along the cervical neck . In aggregate, nerve detection with variable anatomy datasets, provides a substantial challenge for which we will test our proposed method and verify its robustness.
Iv-B Implementation and Setup
We conducted 3 experiments to test the performance and robustness of our framework. The 1st experiment was designed to test the accuracy of the proposed method in detecting and tracking the Vagus nerve on the 1st dataset. The 2nd experiment was designed to test the performance of the proposed method on the more challenging 2nd dataset and compare it to YOLOv4 and EfficientDet, the current state-of-the-art real-time object detectors. In both of these experiments the datasets were divided into individual scans and split into a 64:16:20 ratio for training, validation, and testing, respectively. The 3rd experiment was designed to test and verify the robustness of the framework in accounting for cross subject variabilities as well as its ability to generalize to new subjects. Hence, in this experiment, we divided our dataset by subject. The framework was trained on scans from 4 subjects and tested on the 5th subject. Throughout all three experiments, the scans were resized to 256x256 for the 1st dataset, and to 192x192 for the 2nd
dataset before being supplied to the network in stage 2. The backbone network was optimized using stochastic gradient descent (SGD) with a learning rate =, momentum = , weight decay of , batch size = , and trained for epochs. The proposed work was implemented in Pytorch and our implementation is available in a Github repository .
|Method||Avg. Precision||Avg. Recall|
|Experiment 1 - Dataset|
|Ours - Single Frame||94.4%||97.2%|
|Experiment 2 - Dataset|
|Ours - Single Frame||90.89%||96.01%|
|Ours - Three Frames||92.67%||97.29%|
|EfficientDet - d3||91.93%||96.35%|
|Experiment 3 - Dataset|
|Ours - Single Frame||93.5%||91.9%|
|Ours - Three Frames||95.1%||93.4%|
Iv-C Evaluation and Results
To evaluate the accuracy and robustness of the proposed framework, we used the average precision and recall of localization, where a detection is considered a true positive when a certain localization threshold is met, otherwise that detection is considered a false positive. This is the main metric used to evaluate object detection algorithms . For the 1st and 2nd experiments, the localization threshold is based on the intersection over union (IoU) metric. For the 3rd dataset, the threshold is based on a physical distance of 2.5mm from the center of the nerve, which is equal to the radius of a typical Vagus nerve.
Table I summarizes the performance of the proposed framework for all of the 3 experiments. It can be seen that the proposed framework was able to identify and localize the Vagus nerve in both datasets with a high precision and recall. For the more challenging 2nd dataset, the results of the 2nd experiment show that the proposed method outperforms YOLOv4 and EfficientDet - d3. For the 3rd experiment where the 2nd dataset was divided by subjects, 4 subjects for training and 1 subject for testing, the proposed method was still able to generalize to subjects that it did not see during training and achieved high localization precision and recall. This was not the case for YOLOv4 and EffecientDet where the precision and recall dropped below . This loss of accuracy can mainly be attributed to these methods’ need for large training datasets with rich visual features to train their backbone and detection networks [6, 28]. For the identification of target frames, the performance metrics for the classifier in stage 3 are 93.01% for precision and 86.25% for recall.
Fig. 3 shows an example scan from each dataset with the ground truth and predicted bounding boxes. To verify the robustness of the proposed framework, we conducted two additional experiments to analyze the proposed framework performance on new subjects while being trained on smaller subsets of the original dataset. The results of these two experiments are shown in Fig. 1. In the first experiment, we trained the framework on 1, 2, 3, and 4 subjects then tested on a 5th subject and repeated this analysis twice for two different test subjects. We then used three subjects for training and two subjects for testing, randomly sampled scans from the training set, and created training subsets of sizes 500, 800, 110, and 1400. As observed in Fig. 1, the proposed framework has a high level of consistency and accuracy even when trained on smaller training sets. The framework produces high localization precision where more than 95 of the true positives predictions are located within 1.5 mm from the ground truth in both the lateral and axial directions, which is shown in the heat map and histograms of the true positive detections offset from the ground truth in Fig. 4.
We presented a weakly trained segmentation-based deep learning framework for real-time object detection and localization in ultrasound scans and tested its performance on detecting the Vagus nerve with an inference time of less than 33 ms. The framework used masks with bounding boxes enclosing the Vagus nerve as a target for the segmentation backbone network. It demonstrated that it can detect the Vagus nerve and localize it successfully with a limited number of training examples and without the need for time consuming and expensive pixel-wise annotations, such as those needed for segmentation tasks.
-  (2019) Diagnostic ultrasound: head and neck. Diagnostic Ultrasound, W. B. Saunders. External Links: Cited by: §IV-A.
-  TRBG/vagus-nerve-u-net External Links: Cited by: §IV-B.
-  (2014) Two-dimensional speckle tracking using zero phase crossing with riesz transform. In Proceedings of Meetings on Acoustics, Vol. 22. Cited by: §I.
-  (2020) Deep learning based motion tracking of ultrasound image sequences. In 2020 IEEE International Ultrasonics Symposium (IUS), pp. 1–4. External Links: Cited by: §I.
-  (2020) Faster search algorithm for speckle tracking in ultrasound images. In 2020 42nd Annual International Conference of the IEEE Engineering in Medicine Biology Society (EMBC), Vol. , pp. 2142–2146. External Links: Cited by: §I.
-  (2020) Yolov4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. Cited by: §I, §IV-C.
-  (2000) Vicinal risk minimization. In Proc. of the 13th International Conference on Neural Information Processing Systems, NIPS’00, Cambridge, MA, USA, pp. 395–401. Cited by: §III-A.
-  (2005) Histograms of oriented gradients for human detection. In , Vol. 1, pp. 886–893. Cited by: §I.
-  (2014) Discriminative unsupervised feature learning with convolutional neural networks. In Advances in Neural Information Processing Systems 27, pp. 766–774. External Links: Cited by: §III-A.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §IV-C.
-  (2008) A discriminatively trained, multiscale, deformable part model. In 2008 IEEE conference on computer vision and pattern recognition, pp. 1–8. Cited by: §I.
-  (2000) Matching techniques to compute image motion. Image and Vision Computing 18 (3), pp. 247–260. External Links: Cited by: §I.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §I, §II-A.
-  (2007) Real-time vessel segmentation and tracking for ultrasound imaging applications. IEEE transactions on medical imaging 26 (8), pp. 1079–1090. Cited by: §II-B.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §III-C.
-  (2015) Ultrasound in radiology: from anatomic, functional, molecular imaging to drug delivery and image-guided therapy. Investigative radiology 50 (9), pp. 657. Cited by: §I.
-  (2018) Liver lesion detection from weakly-labeled multi-phase ct volumes with a grouped single shot multibox detector. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 693–701. Cited by: §II-A.
-  (2016) Noninvasive transcutaneous vagus nerve stimulation decreases whole blood culture-derived cytokines and chemokines: a randomized, blinded, healthy control pilot trial. Neuromodulation: Technology at the Neural Interface 19 (3), pp. 283–290. Cited by: §I.
-  (2016) SSD: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §II-A.
-  (2012) Overview of therapeutic ultrasound applications and safety considerations. Journal of ultrasound in medicine 31 (4), pp. 623–634. Cited by: §I.
-  (2018) High-resolution multi-scale computational model for non-invasive cervical vagus nerve stimulation. Neuromodulation: Technology at the Neural Interface 21 (3), pp. 261–268. Cited by: §I.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §II-A.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §II-A.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I, §I, §II-B.
-  (2014) OverFeat: integrated recognition, localization and detection using convolutional networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, External Links: Cited by: §I, §I, §II-A.
-  (2003) Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., Vol. 2, pp. 958–958. Cited by: §III-A.
-  (2015) Real-time automatic artery segmentation, reconstruction and registration for ultrasound-guided regional anaesthesia of the femoral nerve. IEEE Transactions on Medical Imaging 35 (3), pp. 752–761. Cited by: §II-B.
-  (2020-06) EfficientDet: scalable and efficient object detection. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II-A, §IV-C.
-  (2006) The examiner’s ultrasound experience has a significant impact on the detection rate of congenital heart defects at the second-trimester fetal examination. Ultrasound in Obstetrics and Gynecology 28 (1), pp. 8–14. Cited by: §I.
-  (2015) Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 648–656. Cited by: §III-B.
-  (2001) Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, Vol. 1, pp. I–I. Cited by: §I.
-  (2009) Fully automated common carotid artery and internal jugular vein identification and tracking using b-mode ultrasound. IEEE Transactions on Biomedical Engineering 56 (6), pp. 1691–1699. External Links: Cited by: §I.
-  (2017) The effectiveness of data augmentation in image classification using deep learning. Convolutional Neural Networks Vis. Recognit 11. Cited by: §III-A.
-  (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, External Links: Cited by: §III-A.
-  (2018) Unet++: a nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11. Cited by: §II-B.
-  (1994) Morphometric analysis of white matter lesions in mr images: method and validation. IEEE Transactions on Medical Imaging 13 (4), pp. 716–724. External Links: Cited by: §III-B.