Ultrasound (US) is a non-invasive diagnosis methodology since it utilizes low energy US waves in order to capture tissue characterizations. US image segmentation has been used in diagnosis as well as image-guided interventions 
. Despite the wide applications of US images, they are largely contaminated with speckle noise, which renders accurate segmentation of different structures challenging. Thus, recently, US segmentation problems have attracted a growing number of research groups, which has led to important achievements using both traditional machine learning techniques and more recently deep learning algorithms[1, 2].
Recent deep learning techniques have shown promising results on various data analysis tasks due to their power in learning efficient and compact data representations in classification , segmentation , and automatic speech recognition  tasks. Despite this noticeable success, training deep learning architectures requires large amount of data. In case of US imaging, data acquisition and interpretation by clinical experts are very expensive. To address this issue, two research avenues are often pursued. First, an architecture can be used that is designed with this limitation in mind. The U-Net architecture , built upon fully convolutional network 
, is the most well-known architecture for biomedical image segmentation which takes advantages of several convolutional, max-pooling, and upsampling layers. The application of U-net on simulated US images has been recently proposed. Second, in order to increase the size of data, various data augmentation strategies for medical images have been proposed [7, 8]
. However, to the best of our knowledge, there are no applications of training a deep neural network architecture on the simulated US data and test on the real data. In addition, there is no previous work that compares the utility of envelope data and B-mode images. Therefore, we address the following two questions in this work:
Can a network be trained on simulation data and tested on real data?
Which data provides the best segmentation results, envelope data or B-mode images?
Radio-frequency (RF) data is the raw data that is generated from beamforming the channel data. This data is not suitable for visualization as it contains very high frequencies, and therefore its envelope is extracted. Since the envelope data has very high dynamic ranges, it undergoes a logarithmic compression, which is often called B-mode US image and is suitable for display. Although logarithmic compression lead to better visualization of the US data, pixels with higher values are compressed and a lot of information is lost during this compression transform. In this work, we compare the segmentation results of envelope data and B-mode image.
Herein, we propose a novel strategy based on the use of simulated US images as the training set. We utilize the U-Net architecture to learn the segmentation masks of the simulated US images. Afterward, the segmentation masks of tissue-mimicking phantom data is predicted using the trained network. The outline of this study is as follows. In Section II, details of U-Net architecture along with its parameters, as well as simulated and phantom data used for training and testing, respectively, are elaborated. Section III presents the predicted segmented masks achieved by implementing our proposed strategy as well as quantitative comparison of the results. We conclude this paper in Section IV with some directions for future research.
Ii-a U-Net Architecture
U-Net architecture consists of two paths, a contracting path and an expansive path. The contracting path (left path in Fig. 1) consists of several repetition of two convolution (conv) and one max-pooling (max pool) layers with the kernel sizes of and , respectively. The expansive path (right path in Fig. 1
) comprises of the repetition of concatenation of the features extracted from corresponding layers in contracting path (copy and crop), two convolution, and one upsampling (up-conv) layers. In this path, the kernel for convolution layer is same as the contracting path andfor upsampling layer. The final layer has one convolution kernel.
Activation functions in convolution layers are set to rectified linear units (ReLU)  except the last layer, which is set to Softmax. We pose the segmentation problem as a pixel-wise classification that lead to three class classification for our dataset. In the last layer of our architecture three
convolution kernels are used, and the loss function is set to categorical cross-entropy. The learning rate, optimizer, and weight initializer function are set to 0.00001, Adam, and He-normal , respectively. For the remaining parameters, we follow the initial parameters proposed in 
. U-Net is trained through 100 epochs with batch size of 8 in order to fit data in the memory. The codes for implementing U-Net are scripted on python (version 3.6) using Keras with Tensorflow backend. A Titan Xp NVIDIA GPU with 12 GB of memory on Ubuntu 16.04 LTS is used for training and testing.
Ii-B Simulation Data
Simulated RF is generated using the publicly available ultrasound simulation software, Field_II [12, 13] based on MATLAB release 2018, followed by envelope extraction. The virtual ultrasound transducer parameters are initialized as outlined in Table I. We randomly distribute scatterers in all phantoms to ensure each has in average 4 scatterers. The simulated images randomly consist of hyperechoic lesions (i.e. tissues with higher echogenicity), and anechoic lesions (i.e. tissues with lower echogenicity). In hyperechoic lesions, the scatterers intensities are times larger than background in which is a random integer value between 1 and 10. The lesions are placed between -20 to +20 in the lateral direction and between 30 to 90 in the axial direction. Lesion shapes are circles or ellipses with random sizes. The radii of circles are between 1-3 , and semi-major and semi-minor axes of ellipses are between 5-9 and 1-5 , respectively. In total, 700 images are simulated. We then split the data to training, validation and testing data sets considering 60%, 15%, and 25% splitting factors of the total images, yielding 420, 105, and 175 images, respectively. Envelope and B-mode images with initial size of are then resized to , mirrored (with the same mirroring factor as described in ) to size . The intensity range is normalized to the range between 0-1 before feeding to the U-Net. As mentioned earlier, a simulated data may consist of one hyperechoic and one anechoic lesion. Therefore, including the background, three classes should be categorized. As such, the ground truth of a simulated image is in the size of .
|Property Name||Property Value|
|Number of RF lines||50|
|Start depth of simulation data||30 from the transducer surface|
|Depth of simulation data||90 from the transducer surface|
|Lateral distance of simulation data||40 (from -20 to 20 )|
|Speed of sound||1540 m/s|
|Center frequency||3.5 MHz|
|Sampling frequency||100 MHz|
Ii-C Tissue-Mimicking Phantom Data
US data is acquired from a CIRS Multi-Purpose Multi-Tissue ultrasound phantom with an Alpinion E-Cube system (Bothell, WA) using the L3-12H transducer at the center frequency of 10 MHz and a sampling rate of 40 MHz. Table II and Fig. 2 indicate the transducer and phantom details, respectively. The phantom data includes different types of lesions in different depths with circular shapes. In this work, total of 6 phantom images with the depth of 40 are acquired from different locations of the phantom.
|Property Name||Property Value|
|Number of RF lines||384|
|Start depth of simulation data||4 from the transducer surface|
|Depth of simulation data||40 from the transducer surface|
|Lateral distance of simulation data||40 (from -20 to 20 )|
|Speed of sound||1540 m/s|
|Center frequency||10 MHz|
|Sampling frequency||40 MHz|
Ii-D Quantitative Evaluation
To evaluate the performance of the network, here we report two different metrics in order to compare the predicted mask with the ground truth mask, Dice Similarity Coefficient (DSC) and score defined as below:
where , , and are true positive, false positive, and false negative, respectively. For the simulation data, the images are simulated based on the predefined information of the location of the lesions, which is considered as the ground truth. However, for the phantom data, the ground truth is manually obtained using the ImageJ software .
U-Net is trained on simulated envelope and B-mode images of solely the simulation experiments, yielding two different trained weights. Subsequently, the trained weights are used to test on simulated (different from the training simulation data) and real phantom data, yielding predicted segmentation masks.
To provide a comprehensive comparison, the results of predicted masks derived from training U-Net based on simulated envelope and B-mode images and testing on both simulated and phantom data is illustrated in this section. Furthermore, we outline the and scores for simulated and phantom data.
Iii-a Predicted Masks for Simulation Data
Fig. 3 represents an example of simulated envelope data, B-mode image, the ground truth mask, and the predicted masks. In this particular example, the simulated data consists of 6 types of lesion including 5 hyperechoic and one anechoic. It is important to highlight that four of the lesions are located on the borders and therefore are only partly contained in the image. The predicted masks provide clearer boundaries of all aforementioned lesions compared to the ground truth mask.
Mean and standard deviation ofand scores for predicted masks from the network trained on envelope and B-mode image are summarized in Table III.
The mean of evaluation scores for the predicted masks from the network trained on envelope and B-mode image are and for , and and for score, respectively. The high values in both and scores indicate that U-Net has a promising structure in segmentation of ultrasound images and is capable in learning the intrinsic features of the simulated data. Furthermore, it shows that the network can learn mappings from the domain of envelope or B-mode image to pixel-level segmentation mask.
Iii-B Predicted Masks for Phantom Data
An example of the envelope data and B-mode image of our tissue-mimicking phantom is shown in Fig. 4. Figures 4 (d) and (e) show the results of training our network on envelope and B-mode images of simulated data and testing on the phantom data. In this particular example, the phantom data consists of three lesions. In all predicted masks, the anechoic lesion (dark cyst) which is more clearly visible is segmented successfully. The mask derived from envelope data outperforms the B-mode mask as a small portion of hyperechoic lesion is detected. The anechoic lesion predicted from B-mode images is shown in white instead of grey and this is because only one class has been detected.
It is important to highlight that the network has not seen any real images and is fully trained on simulation data. Two conclusions can be made from this observation. First, the Field_II simulation model is quite similar to real experiments and can be used for training deep learning techniques. Second, the network is not suffering from overfitting and further has learned an efficient representation of US images.
Table IV presents the quantitative evaluation for phantom data. The mean of evaluation scores for the predicted masks from the network trained on envelope data and B-mode image are and for , and and for , respectively.
In this work, we presented the feasibility of training a deep learning architecture on simulated US data and testing on the real tissue-mimicking phantom data. As a consequence, our work offers the use of simulated data as an alternative for datasets with limited training data. Moreover, envelope data outperforms the results derived from training on B-mode images. For future investigations, we aim to test our strategy on in-vivo data. We will further examine whether a network trained on simulation data can be fine-tuned using few real experiments.
We would like to thank NVIDIA for donation of the Titan Xp GPU.
-  J. A. Noble and D. Boukerroui, “Ultrasound image segmentation: a survey,” IEEE Transactions on medical imaging, vol. 25, no. 8, pp. 987–1010, 2006.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol. 29, no. 6, pp. 82–97, 2012.
-  A. A. Nair, T. D. Tran, A. Reiter, and M. A. L. Bell, “A deep learning based alternative to beamforming ultrasound images,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 3359–3363.
-  S. Pereira, A. Pinto, V. Alves, and C. A. Silva, “Brain tumor segmentation using convolutional neural networks in mri images,” IEEE transactions on medical imaging, vol. 35, no. 5, pp. 1240–1251, 2016.
-  H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image synthesis for data augmentation and anonymization using generative adversarial networks,” in International Workshop on Simulation and Synthesis in Medical Imaging. Springer, 2018, pp. 1–11.
V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” inProceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
-  J. A. Jensen, “Field: A program for simulating ultrasound systems,” in 10th Nordicbaltic Conference on Biomedical Imaging, Vol. 4, Supplement 1, Part 1: 351–353. Citeseer, 1996.
-  J. A. Jensen and N. B. Svendsen, “Calculation of pressure fields from arbitrarily shaped, apodized, and excited ultrasound transducers,” IEEE transactions on ultrasonics, ferroelectrics, and frequency control, vol. 39, no. 2, pp. 262–267, 1992.
-  C. A. Schneider, W. S. Rasband, and K. W. Eliceiri, “Nih image to imagej: 25 years of image analysis,” Nature methods, vol. 9, no. 7, p. 671, 2012.