Ultrasound tongue imaging provides a non-invasive means for assessing tongue position and movement during speech production. However, the presence of speckle noise and irrelevant high contrast edges often degrades the usability of ultrasound images by obscuring the tongue surface . Consequently, extracting tongue contours from ultrasound images remains a non-trivial task.
In linguistic and clinical phonetics, extracting tongue contours is the usually first step in analyzing ultrasound images, but this process is time-consuming. For phoneticians and speech scientists, tongue contours offer a direct visualization and measurement of certain articulatory processes. In the past decade, various methods for semi-automatic or automatic tongue contour extraction have been proposed to facilitate the analysis of ultrasound data, notably the Active Contour (Snake) based methods [2, 3, 4], graph based methods , and neural network based methods [6, 7, 8, 9, 10, 11]. Both Snake based and graph based methods are mostly semi-automatic, which still require manual initialization, but methods like automatic initialization  or particle filtering  can gear the algorithm towards more automatic segmentation. Neural network based methods are promising for fully automatic segmentation. Prior works utilized deep neural networks [12, 13]6]; recently fully convolutional neural networks such as variants of the U-Net  have been adapted to segment tongue contours [9, 10, 11, 15].
Studies comparing some of the publicly available methods show that semi-automatic or automatic tracing can approximate human annotations under some conditions [16, 17], but these tools require either extensive human intervention, relevant technical knowledge, or proprietary software, which considerably limit their usage. Few studies explore the generalizability of contour tracking methods across speakers and different ultrasound machines .
In this paper, we extend previous works on U-Net based models [9, 10] by implementing a new tool for automatic tongue contour extraction using both U-Net and Dense U-Net. We systematically tested the performance of these models with different test datasets. The results show that, while both U-Net and Dense U-Net can achieve high accuracy in automatic tracking, loss function and data augmentation have a larger impact on actual tracking performance. In this task, the deeper Dense U-Net might not necessarily outperformed the shallower U-Net if not properly trained. Most importantly, given that a fully automatic tool for contour tracking is not public available, we are filling this gap by releasing a new open source tool to facilitate the otherwise time-consuming process of contour tracking in speech production research.
In our approach, we first train a convolutional neural network to segment the brightest edge corresponding to the tongue tissue-air interface from a noisy ultrasound image, and then derive a tongue surface curve through post-processing of the segmented image. The source code, pre-trained models and some of the test data are available at https://github.com/lingjzhu/mtracker.github.io.
For the baseline, we have adopted the U-Net architecture, a variant of the Fully Convolutional Neural Network (FCNN) widely used in medical image segmentation 
. The typical U-Net architecture consists of a downsampling path with repeated convolution blocks and max-pooling layers, and an upsampling path with deconvolution layers and convolutional blocks (see Fig.1). U-Net also introduced the skip-connection, or concatenating feature maps in the downsampling path and feature maps in the upsampling path to enable the reuse of low-level features in higher layers. We used the following settings for the U-Net model. Each convolutional block has the following components: 33 conv + ReLU + 22 max pool. Each de-convolutional block in the upsampling path has 22 up-conv + 33 conv + ReLU + 3
3 conv + ReLU. All convolution operations use a stride of one and zero padding. The number of feature maps doubled after each convolutional block with a range of (32, 64, 128, 256, 512), and halved after each de-convolutional block with a range of (256, 128, 64, 32). The final layer was a 11 conv layer with sigmoid activation.
2.2 DenseNet and Dense U-Net
The Dense Convolutional Network (DenseNet) is a network architecture that has been shown to be effective in many computer vision tasks, outperforming some of the classic architectures such as ResNet. Dense U-Net is an adapted network architecture that fuses both DenseNet and U-Net, thereby adapting DenseNet to segmentation task at the pixel level [19, 20]. It combines the DenseNet and the U-Net by introducing a symmetric upsampling path and long range skip-connections to enable the reuse of low-level features.
In this study, we adopted the standard DenseNet-121 architecture 
as the downsampling path by removing its top classification layer, leaving only the dense blocks and transition layers. Each dense block has repeated convolutional blocks consisting of batch normalization (BN) + ReLU + 11 conv + BN + ReLU + 33 conv with a growth rate of 32, or the number of feature maps of each convolution layer. There are 6, 12, 24 and 16 convolutional blocks in four dense blocks respectively. Within the dense block, the input feature maps feeds into a sequence of operations mentioned above, which produces the output feature maps. Then the input and output feature maps are concatenated together to become the input for next sequence of operations. For the transition layers, each consists of BN + ReLU + 11 conv + 22 average pool.
In the upsampling path, de-convolutional layers are used to increase the image size and skip-connnecting the corresponding dense blocks with later layers allows us to reuse the feature maps, as in U-Net. Each de-convolutional block has a 22 de-convolutional layer and a dense block. Each dense block in the upsampling path has a single convolutional block (BN + ReLU + 11 conv + BN + ReLU + 3
3 conv) with 16, 24, 12, 6 and 6 growth rates respectively. As de-convolutional layers can also perform feature extraction alongside upsampling, each dense block only has a single convolutional sub-block. Finally, the output layer is a 11 conv layer with sigmoid activation, which is used to resize and scale the feature maps to a single-channel grayscale image.
2.3 Loss functions
One of the main challenges in this task was the extreme class imbalance between tongue-related pixels and the irrelevant background pixels. On average, the relevant pixels corresponding to a tongue-shape annotation (a ‘mask’) only comprise 2% of the total pixels. Different loss functions have been proposed to address this question. The Dice Similarity Coefficient (DSC)  only penalizes the mismatch between the predicted white pixels (representing the tongue region) and the white edge in the mask, while excluding all background pixels and noise during the optimization process. Thus, the learning task can be formulated as minimizing the following loss function:
where is the softmax output between 0 and 1, and when i is in the ground truth contour and 0 otherwise. represents the predicted tongue region given by the CNN and the ground truth. A smoothing factor of
, which was set to 1 here, was added to to make the loss function smooth and to avoid zero division. Compared with WSC, although the DSC can generate a slim tongue spline, but it also tends to force the model to generate probability values close to either 0 or 1, leading to overconfidence. The generated heatmaps are highly binarized, which is not a good reflection of the probablistic encoding of the original masks.
Another way to counterbalance the disparity between two classes is to use the weighted binary crossentropy loss . Assigning too large of a weight to the minority class (contour) may cause the model to overpredict the minority class, resulting in oversmoothing the predicting tongue shape. Given the standard crossentropy loss,
The class weighted crossentropy (Eq. 3) assigns different weights and to the two categories by setting the weights to be the inverse of the ratios of two categories respectively.
The compound loss (Eq. 4) is the weighted sum of the Dice loss and the standard crossentropy loss, with the weight
being a hyperparameter that can be tuned. The standard crossentropy functions as a regularizer to control the overconfidence given by DSC, forcing the model to generate a more gradient probabilistic heatmap.
By adjusting , we can tune the predicted heatmap. We set in the current task based on pilot experiments with validation data. In order to assess the effect of these loss functions, we systematically compared the performance of three loss functions, namely the Dice loss, the weighted crossentropy (WC) and the compound loss.
Midsagittal ultrasound data was collected as MPEG video at 60 frames per second, using a Zonare Z.One Ultrasound Unit, operating at 4MHz and 70Hz scan rate with a P4-1C transducer. Tongue shape curves were annotated with Mark Tiede’s GetContours package for MATLAB  111https://github.com/mktiede/GetContours, generating a 100 point spline for each curve from human-specified anchor points. Annotators were trained to mark the bottom edge of the white reflectance signal corresponding to the tongue surface. Our data consisted of 35160 human-annotated ultrasound frames from 11 American English speakers producing vowel and vowel-lateral syllable nuclei in C2lC and C2C pairs (e.g. ‘bulk’ and ‘buck’), collected for another project.
The data were split into training, validation and test sets through random partitioning, each consists of 45%, 5% and 50% of the total data. All models were trained only on the training dataset. In order to test the generalizability of our model to multiple machines and configurations, we used three datasets, listed below. Except the NS test data, the remaining test sets were manually annotated by the first author. All images were scaled to 128 128 pixels.
The NS test data consisted of 3926 frames from two additional American English speakers (one male and one female) reading ‘The North Wind and the Sun’, collected using the same equipment and settings as the training data, but annotated in its entirety by each of three trained annotators.
The Ultrax test data consisted of 793 ultrasound images collected from a male typical developing child and a female child with speech disorder .
The UltraSpeech test data were primarily 241 ultrasound frames from two French sentences, each read by a different male French speaker .
Each human annotation is represented by 100 pairs of Cartesian x-y coordinates. Each tongue-shape annotation was generated as a probability heatmap of the same size as the original ultrasound image (‘mask’). Given a sequence of x-y coordinates , the Gaussian kernel in Eq. 5 was used to map the human-created 100-point tongue contour data into the mask. The
indicates the pixel intensity at point (x,y), representing the probability of each pixel being part of the tongue contour. Thus, pixels closer to the actual tongue surface coordinates are assigned higher probabilities, while all other pixels are gradually diminishing to 0 as they are further away from the contour. The key is to treat each point as the center of a Gaussian distribution and then create a distribution over it on the mask. Then the distribution for each point is added up and then normalized between [0,1].
The indicates the pixel intensity at point (x,y), representing the probability of each pixel being part of the tongue contour. Thus, pixels closer to the actual tongue surface coordinates are assigned higher probabilities, while all other pixels are gradually diminishing to 0 as they are further away from the contour. In actual implementation, the default in this study is set to 4, and values below 0.4 were thresholded to only retain pixels with high probabilities.
The training data were divided into multiple mini-batches, each with a size of 32 images. We used the Adam optimizer 
with a learning rate of 0.0001, and the model was trained for 30 epochs. The training process took approximately 2 hours using an NVIDIA Tesla K40 GPU in the University of Michigan’s FLUX computing cluster. The model that achieved the lowest validation loss was retained as the final model.
For each new image fed into the model, the output is a probability heatmap having the same size as the input image, with the intensity of each pixel again corresponding to the probability that the pixel is part of the tongue. A 50% threshold is then applied to the image to filter out unlikely predictions. Then a skeletonization algorithm 
is used to reduce the white edge to a single pixel wide representation. It is then interpolated and smoothed using ’UnivariateSpline’ in the SciPy Package with the default settings. The resulting output is a 100-point Cartesian coordinate representation of the predicted tongue shape.
The metric for evaluation of error from human annotation is the Mean Sum of Distance (MSD), which permits the comparison of two curves without requiring point-wise alignment . The MSD between two sequences U and V can be computed as the average distance between a given point and its nearest point in another sequence:
where n is the number of points in each sequence, and ui and vj are pairs of x-y coordinates from two sequences U and V under comparison.
5.1 Same-speaker evaluation
Table 1 below displays the average MSD between our model and the human annotators on all 17580 test ultrasound images222We attempted to run comparisons with prior splining algorithms, but we were unable to find an appropriate set of hyperparamters for TongueTrack  for our dataset, and were unable to run AutoTrace  because of deprecated dependencies.. With the exception of the weighted crossentropy loss, other models performed almost equally well, achieving an MSD as small as 0.85mm (about 3.5px). The low tracking error is likely due to the fact that these test images were from the same group of speakers in the training data.
|D UNet-WC||4.35 (2.06)|
|D UNet-Dice||3.25 (1.96)|
|D UNet-Compound||3.79 (2.20)|
Mean and (Standard Deviation) of Mean Sum of Distance (in Pixels, 1 pixel0.25mm) for the 17580 frame test dataset.
5.2 Cross-speaker evaluation
shows the model performance on the NS test dataset relative to each human annotator. The average human-to-human difference (measured using MSD) was around 0.7mm (an estimated 2.79px), whereas the average-human-to-CNN difference was around twice that (1.25mm, 5px). This remains good performance, though, particularly given that the NS test set contains two speakers and more diverse tongue shapes corresponding with fluid speech. The Dense U-Net with compound loss shows slightly better performance relative to the other models but, given the large variance resulting from outliers, it cannot be conclude that model architecture matters in this task.
The results also demonstrate that weighted crossentropy is not a suitable loss function for the current task as models trained with weighted crossentropy lagged considerably behind other models. In the weighted crossentropy loss, the ”tongue contour” class is given a much higher weight, so the model tends to predict as much ”tongue contour” class as possible to minimize the loss, resulting in thick contours in prediction. In contrast, the Dice loss might be more effective in dealing with class imbalance. The downside of the Dice loss is that it biases the model to predict the ”tongue contour” class with a probability that is uniformly 1, leading to overconfidence. In the compound loss (Eq.4), the standard crossentropy term is similar to a regularization term. As the Dice loss only assesses the intersection between the “the tongue contour” class in the prediction and the ground truth, the crossentropy term puts some weights on those “background” pixels in the masks, resulting in a more gradient prediction. It turned out that this loss function can reduce more outlier predictions than the Dice loss function.
|A||0 (0)||2.33 (1.57)||2.83 (1.85)|
|B||2.33 (1.57)||0 (0)||3.21 (2.21)|
|C||2.83 (1.85)||3.21 (2.21)||0 (0)|
|UNet-WC||6.65 (2.92)||6.44 (2.74)||7.25 (3.24)|
|UNet-Dice||5.70 (2.68)||5.33 (2.37)||6.09 (2.87)|
|UNet-Compound||5.31 (2.60)||4.93 (2.25)||5.64 (2.76)|
|D UNet-WC||7.74 (3.27)||7.48 (3.03)||8.25 (3.45)|
|D UNet-Dice||5.15 (2.54)||4.77 (2.19)||5.65 (2.68)|
|D UNet-Compound||5.01 (2.52)||4.58 (2.12)||5.33 (2.63)|
5.3 Data augmentation and training data size
We retrained models by varying the training data size incrementally from 1%, 5%, 10%, 30%, 50% to 100%. Data augmentation was applied to generate more diverse training data, including horizontal flipping, rotation within the range of and , zooming, and horizontal and vertical shifting. Fig.4 demonstrates that, despite some minor fluctuations, MSD tends to decrease with more training data. Models with data augmentation outperformed models without augmentation by a small margin when the original training data size was small, but the improvement brought by augmentation disappeared when the model was trained with more training data. It is noticeable that Dense U-Net consistently showed slight improvements over U-Net with the original training data. However, after training with augmented data, both U-Net and D-Net tend to have very similar performance, even approximating the best performance with as small as 30% of the entire training data (around 5000 frames). With data augmentation, the difference in performance resulting from difference in architecture was greatly diminished. Even with only 1% of training data (about 160 frames) the model can achieve reasonable accuracy with data augmentation. This highlights the importance of data augmentation in the current task.
5.4 Image size
Image size also affects the model performance, as shown in Table 3. Though both models produced comparable MSD with image size of 32 32 or 224 224, Dense U-Net gives the lowest MSD when the image size is 64 64. These results show that images with more details might not necessarily result in improved performance.
|Model||32 32||64 64||224 224|
|UNet-Compound||5.81 (2.86)||5.94 (3.05)||5.66 (2.66)|
|D UNet-Compound||5.60 (2.93)||5.03 (2.52)||5.53 (2.78)|
5.5 Testing images from different machines
To test the generalizability of our model, we examined the model performance with the additional Ultrax and UltraSpeech datasets, the results of which are displayed in Table 4
. All models can well predict the tongue shapes in the Ultrax test set with similar performance. The UltraSpeech test set posed a bigger challenge to these models because its noise distributions are quite different from those in the training data, but the Dense U-Net with compound loss proved to be stable across dataset, as it produces low MSD in both test sets. However, these results should also be interpreted with caution as the accuracy is sensitive to the selection of ROI. We used the same ROI for all models, but the results may vary if a different ROI is selected. This again highlights the difficulty of cross-domain prediction. A potential solution to cross-domain prediction can be transfer learning.
|UNet-Dice||5.52 (1.65)||7.92 (4.74)|
|UNet-Compound||6.10 (3.36)||8.25 (4.83)|
|D UNet-Dice||6.35 (2.40)||6.68 (3.46)|
|D UNet-Compound||5.71 (1.66)||5.72 (2.88)|
6 Error analysis
As the CNN is trained to identify the white edges directly corresponding to the tongue surface, additional or missing white edges due to bad image quality or speaker physiology can lead to failures in identifying parts of the tongue surface. In the absence of prior knowledge of plausible tongue shapes, the model will sometimes generate tracking errors when the white edge becomes blurry or interrupted. Similarly, bright edges in the image background are likely to be recognized as part of the tongue; tongue contours generated from image frames with these edges will likely suffer from implausible curvatures as interpolation in post-processing attempts to connect these regions. There some potential solutions to these problems, including incorporating temporal constraints on tongue contour variations across frames 
, or adding a smooth constraints that penalizes discontinuity of tongue contours, or introducing a strong prior probability of possible tongue locations. In data processing, these issues can also be mitigated by tuning the parameters in post-processing to match the needs of the specific dataset, and remaining errors can also be addressed through manual correction (as even then, the workload is considerably reduced relative to manually labeling all frames).
In this study, we present a new open source tool for fully automated tongue contour extraction based on U-Net and Dense U-Net models. The implemented models are tested extensively on multiple test datasets. Though both models can perform automatic contour tracking with comparable accuracy, Dense U-Net architecture seems more generalizable across datasets but U-Net has faster extraction speed. Our evaluation results show that the choice of loss function and data augmentation have a larger effect on model performance than simply stacking more layers. Crucially, unlike many prior solutions, our tool requires minimal human intervention to obtain point-by-point splines. The average speed for U-Net is 63 frames per second, and 29 frames per second for Dense U-Net on a consumer-grade laptop with Intel i-5 8600K processors and Nvdia 1070Ti GPU. The automatic contour extraction performed by our tool can potentially facilitate the time-consuming manual annotations in phonetic and clinical research.
We are grateful to Patrice Speeter Beddor, Andries Coetzee, Thomas Hueber and the UltraSuite research group for making available their ultrasound data. The data from Beddor and Coetzee were collected for a different project supported by NSF grant BCS-1348150.
-  M. Stone, “A guide to analysing tongue motion from ultrasound images,” Clinical Linguistics & Phonetics, vol. 19, no. 6-7, pp. 455–501, Jan. 2005.
-  M. Li, C. Kambhamettu, and M. Stone, “Automatic contour tracking in ultrasound images,” Clinical Linguistics & Phonetics, vol. 19, no. 6-7, pp. 545–554, Jan. 2005.
-  K. Xu, Y. Yang, M. Stone, A. Jaumard-Hakoun, C. Leboullenger, G. Dreyfus, P. Roussel, and B. Denby, “Robust contour tracking in ultrasound tongue image sequences,” Clinical Linguistics & Phonetics, vol. 30, no. 3-5, pp. 313–327, May 2016.
-  C. Laporte and L. Ménard, “Multi-hypothesis tracking of the tongue surface in ultrasound video recordings of normal and impaired speech,” Medical image analysis, vol. 44, pp. 98–114, 2018.
L. Tang and G. Hamarneh, “Graph-based tracking of the tongue contour in
ultrasound sequences with adaptive temporal regularization,” in
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, Jun. 2010, pp. 154–161.
-  A. Jaumard-Hakoun, K. Xu, P. Roussel-Ragot, G. Dreyfus, and B. Denby, “Tongue contour extraction from ultrasound images based on deep neural network,” arXiv:1605.05912 [cs], May 2016, arXiv: 1605.05912.
-  J. Berry and I. Fasel, “Dynamics of tongue gestures extracted automatically from ultrasound,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2011, pp. 557–560.
-  D. Fabre, T. Hueber, F. Bocquelet, and P. Badin, “Tongue Tracking in Ultrasound Images using EigenTongue Decomposition and Artificial Neural Networks,” in 16th Annual Conference of the International Speech Communication Association (Interspeech 2015), Dresden, Germany, Sep. 2015.
S. Wen, “Automatic tongue contour segmentation using deep learning,” Master’s thesis, Université d’Ottawa/University of Ottawa, 2018.
-  J. Zhu, W. Styler, and I. C. Calloway, “Automatic tongue contour extraction in ultrasound images with convolutional neural networks,” The Journal of the Acoustical Society of America, vol. 143, no. 3, pp. 1966–1966, 2018.
-  M. H. Mozaffari and W.-S. Lee, “Bownet: Dilated convolution neural network for ultrasound tongue contour extraction,” arXiv preprint arXiv:1906.04232, 2019.
-  J. Berry, D. Archangeli, and I. Fasel, “Automatic classification of tongue gestures in ultrasound images,” in Proceedings of 12th Conference on Laboratory Phonology, 2010.
-  D. Fabre, T. Hueber, L. Girin, X. Alameda-Pineda, and P. Badin, “Automatic animation of an articulatory tongue model from ultrasound images of the vocal tract,” Speech Communication, vol. 93, pp. 63–75, 2017.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  M. H. Mozaffari and W.-S. Lee, “Transfer learning for ultrasound tongue contour extraction with different domains,” arXiv preprint arXiv:1906.04301, 2019.
-  K. Xu, T. Gabor Csapo, P. Roussel, and B. Denby, “A comparative study on the contour tracking algorithms in ultrasound tongue images with automatic re-initialization,” The Journal of the Acoustical Society of America, vol. 139, no. 5, pp. EL154–EL160, May 2016.
-  T. G. Csapo and S. M. Lulich, “Error analysis of extracted tongue contours from 2d ultrasound images,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks.” in CVPR, vol. 1, no. 2, 2017, p. 3.
-  X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P.-A. Heng, “H-denseunet: Hybrid densely connected unet for liver and tumor segmentation from ct volumes,” IEEE Transactions on Medical Imaging, 2018.
-  S. Guan, A. Khan, S. Sikdar, and P. V. Chitnis, “Fully dense unet for 2d sparse photoacoustic tomography artifact removal,” arXiv preprint arXiv:1808.10848, 2018.
-  F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3D Vision (3DV), 2016 Fourth International Conference on. IEEE, 2016, pp. 565–571.
-  S. Xie and Z. Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1395–1403.
-  M. Tiede and D. Whalen, “Getcontours: An interactive tongue surface extraction tool,” Proceedings of Ultrafest VII, 2015.
-  A. Eshky, M. S. Ribeiro, J. Cleland, K. Richmond, Z. Roxburgh, J. M. Scobbie, and A. A. Wrench, “Ultrasuite: A repository of ultrasound and acoustic data from child speech therapy sessions,” in INTERSPEECH 2018: Proceedings of the 19th Annual Conference of the International Speech Communication Association (ISCA), 2-6 September 2018, Hyderabad, India. International Speech Communication Association, 2018.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  T. Zhang and C. Y. Suen, “A fast parallel algorithm for thinning digital patterns,” Communications of the ACM, vol. 27, no. 3, pp. 236–239, 1984.
-  L. Tang, T. Bressmann, and G. Hamarneh, “Tongue contour tracking in dynamic ultrasound via higher-order MRFs and efficient fusion moves,” Medical Image Analysis, vol. 16, no. 8, pp. 1503–1520, Dec. 2012.
I. Fasel and J. Berry, “Deep Belief Networks for Real-Time Extraction of Tongue Contours from Ultrasound During Speech,” in2010 20th International Conference on Pattern Recognition, Aug. 2010, pp. 1493–1496.