Breast cancer has been reported as one of the leading causes of death among women worldwide. Although, digital mammography is an effective modality in breast cancer detection, it has limitations in detecting dense lesions which are similar to dense tissues , and further uses ionizing radiation. Therefore, ultrasound (US) imaging as a safe and versatile screening and diagnostic modality plays an important role in this regard. However, due to contamination of the US images with speckle noise, US images have low resolution and poor contrast between the target tissue and background; thus, their segmentation is currently a challenging task 
. Researchers have utilized recent state-of-the-art deep learning techniques in order to overcome limitations in manual segmentation. Despite the success of deep learning techniques in computer vision tasks, their performance depends on the size of input data which is limited specially in medical US images. The collection and annotation of US images require considerable effort and time which attain the need to a deep learning-based strategy that can be trained on as few annotated data as possible.
The U-Net architecture 
, as one of the most well-known networks for segmentation purposes, is built upon fully convolutional network. It involves several convolutional, max-pooling, and up-sampling layers. To cope with limited input data for training U-Net, researches have proposed various strategies based on data augmentation and transfer learning[4, 6, 8]. Data augmentation cannot truly capture the characteristics of the real data when very limited data is available. To this end, we propose a methodology based on transfer learning which utilizes US simulation data and natural images as an auxiliary dataset. The goal is to enhance the segmentation results while only few images are available. In our work, first, we pre-train the U-Net with US simulated and natural images separately, and then fine-tune the network with only of available in vivo images. We demonstrate improvement in segmentation results when small number of images are available.
In deep learning approaches, the improvement in results depends on the number of training data. Therefore, such techniques perform better if they have larger amount of training data. In medical images, especially in US images, annotating enough number of training data is expensive, and thus, we take advantage of simulated US data as well as natural images as the auxiliary datasets for pre-training U-Net in our proposed workflow. To that end, our proposed workflow consists of three avenues as shown in Fig. 1. In the first avenue, U-Net is trained using only of the in vivo dataset. In the second avenue, U-Net is first pre-trained on the simulated data, and then fine-tuned using the same of the in vivo dataset which was used in the first avenue. And the last avenue is similar to the second avenue with the difference that natural images were used for pre-training. Section 2.5 will clarify each avenue in more details.
2.1 In vivo Data
In vivo dataset includes 163 breast B-mode US images with lesions and the mean image size of . The images as well as their delineation of lesions are publicly available upon request . The breast lesions of interest are generally hypoechoic (i.e. tissues with lower echogenicity), that is, darker than surrounding tissue. Only of the total number of in vivo images were used as training and validation datasets and the remaining were set as the testing datasets. The size of training dataset was selected 4 times larger than the size of validation images yielding , , and images for training, validation, and testing datasets, respectively.
2.2 Simulation Data
To simulate B-mode images, a MATLAB-based publicly available US simulation software, Field_II  was used. Number of RF lines, centre frequency, sampling frequency, and speed of the sound were respectively set to , , , and . In our simulation phantom, the surface started at from the transducer surface and the axial, lateral, and elevational distances were initiated as , , and , respectively. The scatterers were randomly distributed in our virtual phantom such that each of phantom had in average scatterers, to allow for fast ultrasound simulation. In our simulated images we considered each image to randomly have either hyperechoic (i.e. tissues with higher echogenicity), that is brighter than surrounding tissue, or hypoechoic lesions, or both at the same time in order to let our network learn better the various possible textures of the US images. The intensities for hyperechoic lesions were set times higher than the background where was an integer in range of , however, for hypoechoic lesions the intensities were set times the background where
was a random variable betweenand . The location of the lesions was randomly selected with circle or ellipsoid shapes. A total of 700 images were simulated and separated to training, validation, and testing sets with splitting factors of , , and of total number of images, yielding , , and images, respectively. It worth mentioning that as the in vivo data consisted of hypoechoic lesions, in the masks of simulated data only the pixels inside of the hypoechoic lesions were set to , and the remaining pixels were set to . Therefore, there were some simulated images with zero segmented lesions in their masks.
2.3 Natural Data
The natural images are publicly available at . The dataset consists of images of salient objects with their annotations. In our work, the dataset was split to training, validation, and testing sets with splitting factors of , , and of total number of images, yielding , , and images, respectively.
2.4 U-Net Architecture
The U-Net structure previously proposed by  utilizes several conv-block, max-pooling, up-sampling, and skip connection layers as illustrated in Fig. 2. Each conv-block consists of repetition of two convolution layers while in the contraction and expansion paths, followed by max-pooling and up-sampling layers, respectively. In this work, the kernel sizes of convolution, max-pooling, and up-sampling layers were set to 33, 22, and 22, respectively. As a pre-processing step, all the images were resized to , mirrored with the mirroring factor of pixels, yielding images with size , and normalized to the range of . Thus, the size of the input and output data was and , respectively, where
indicates the number of images in each batch. The activation and loss functions, optimizer, learning rate, number of epochs, batch size, weight initializer, and kernel regularizer were initialized as stated in Table1. The Dice score is defined as , where and is ground truth and predicted masks, respectively.
As previously mentioned, in this work we propose three avenues to study the impact of simulated and natural imagesas the auxiliary datasets for US segmentation (see Fig. 1). In the following paragraphs, we will explain each avenue in detail:
2.5.1 Avenue 1: Train U-Net on in vivo data from scratch
In the first avenue, the U-Net structure with above-mentioned parameters, was trained on in vivo images from scratch using and images as training and validation sets, respectively, and was tested on images. We call this trained network as Pt_invivo. Due to small number of training data, we used -fold cross-validation to prevent variation in performance. Prior to each optimization iteration, we performed ”on-the-fly” augmentation by applying random height-shift, width-shift, and zooming.
2.5.2 Avenue 2: Train U-Net with US simulated images and fine-tune with in vivo data
In this avenue, U-Net was first trained using and simulation images as its training and validation sets, respectively. Similar to the first avenue, the U-Net was initialized using parameters mentioned in Table 1. For the simplicity, we refer the trained U-Net with simulated data as Pt_sim. Afterwards, the contraction path of Pt_sim was fine-tuned on in vivo training and validation sets based on parameters in Table 1 except that weights were initialized using the Pt_sim weights. We call the fined-tuned network as Ft_sim_invivo which was tested on in vivo test set. -fold cross-validation and ”on-the-fly” augmentation was used for fine-tuning our Ft_sim_invivo network.
2.5.3 Avenue 3: Train U-Net with natural images and fine-tune with in vivo data
In this step, similar to the described above, U-Net was first pre-trained and then fine-tuned on in vivo. However, for pre-training the network we used and natural images as training and validation sets, respectively. For simplicity, the pre-trained U-Net with natural images is referred as Pt_nat and the fine-tuned network using Pt_nat is referred as Ft_nat_invivo. -fold cross-validation and ”on-the-fly” augmentation was used in the fine-tuning step.
3.1 Evaluation Criteria
In this work, we used Dice Similarity Coefficient () as our evaluation criteria. It is worth noticing that we also usedfunction, and then compared with the ground truth masks. As our dataset was unbalanced (i.e. number of background pixels were higher than the lesion pixels), we only report the scores of the foreground (i.e. lesions) masks ignoring the score of the background.
3.2 Experimental Results
Table 2 presents the scores of the predicted masks derived from Pt_invivo, Pt_sim, Ft_sim_invivo, Pt_nat, and Ft_nat_invivo networks for both training and testing in vivo sets. The score for test set increases when we fine-tune the pre-trained network no matter what type of images were used during pre-training. Therefore, pre-training the network performs better than training from scratch with limited training data. It is worth mentioning that we used number of natural images and number of simulated images during pre-training. However, when we decreased the number of natural images in from to in order to be equal to the number of simulated images, the score was reduced from to as shown in Table 2 (Pt_nat420, and Ft_nat420_invivo are referred as repetition of using natural images). As a result, pre-training the network using simulated data is preferable as the auxiliary dataset than using natural images when same number of images from both datasets is available. Figure 3 demonstrates examples of the predicted masks with their scores.
We had natural images in which hours was needed to train the Ft_nat_invivo network. For training on simulation, hours, and for training/fine-tuning on in vivo, minutes was needed. As more annotations become available, although the U-Net is better trained, more time is needed for the pre-training step.
|Network Name||Train in vivo||Test in vivo|
Mean and standard deviation ofscores for predicted masks of in vivo train and test sets over -fold cross-validation
4 Conclusion and future work
In this work, we showed that pre-training the network performs better than training the network from scratch especially when the number of annotations is limited. We proposed the use of simulated US images as the auxiliary dataset for pre-training the network. In addition, we confirmed that natural images can be also considered as the auxiliary dataset, however, thousands of them are required for optimum results which led to hours of pre-training. Therefore, we conclude that US simulation images are the preferred auxiliary dataset for pre-training the network. As our future work, we will validate our strategy for different type of in vivo datasets and segmentation applications such as prostate cancer and muscle segmentation.
This research was funded by Richard and Edith Strauss Foundation and by NSERC Discovery Grant RGPIN 04136. The authors would like to thank NVIDIA for donating the GPU.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034. Cited by: Table 1.
-  (1996) Field: a program for simulating ultrasound systems. In 10th Nordicbaltic Conference on Biomedical Imaging, Vol. 4, Supplement 1, Part 1: 351–353, Cited by: §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Table 1.
-  (2019) Deep learning in medical ultrasound analysis: a review. Engineering. Cited by: §1, §1.
Rectified linear units improve restricted boltzmann machines.
Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814. Cited by: Table 1.
-  (2016) Brain tumor segmentation using convolutional neural networks in mri images. IEEE Transactions on Medical Imaging 35 (5), pp. 1240–1251. Cited by: §1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Cited by: §1, §2.4.
-  (2018) Medical image synthesis for data augmentation and anonymization using generative adversarial networks. In International Workshop on Simulation and Synthesis in Medical Imaging, pp. 1–11. Cited by: §1.
What is and what is not a salient object? learning salient object detector by ensembling linear exemplar regressors.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4142–4150. Cited by: §2.3.
-  (2017) Automated breast ultrasound lesions detection using convolutional neural networks. IEEE Journal of Biomedical and Health Informatics 22 (4), pp. 1218–1226. Cited by: §1, §2.1.