The 2D ultrasound is a widely used imaging modality for routine clinical diagnosis of breast lesions due to its advantages of real-time, low cost and non-invasion properties. However, ultrasound imaging highly depends on the sonographer’s skill compared to other commonly used techniques such as mammography. Furthermore, interpreting ultrasound images requires an experienced and well-trained sonographer due to the complexity and presence of speckle noise and artifacts. Thus, Computer-Aided Diagnosis (CAD) could be beneficial to minimize the influence of the operator-dependent nature of ultrasound imaging and help sonographers in lesion detection. However, it’s quite challenging to detect breast lesions in the individual frame of ultrasound video for the following reasons. Firstly, the lesion boundary is blurry in some frames which leads to confusion for the detector (see Fig. 1a). Second, high similarity to the background of soft tissues and shadow artifacts result in failure detection of lesions (see Fig. 1b). Moreover, there is only 2D image annotations because sonographers commonly save and label the key frames only. Without supervision information in video, training a video detection network becomes difficult.
There are several existing methods for video detection. Methods based on correlation-filtering such as SiameseFC and kernelized correlation filters process current frame with correlation filters which is related to the detection of the previous frame. If the detection of the previous frame is wrong, the features of non-target area propagated from the previous frame will affect the prediction result of the current frame. On the other hand, there is no video annotation in our dataset to support supervised SiameseFC method. Image-based detection methods such as YOLOV3 and RetinaNet can be trained with labeled 2D images and then be applied on the video data frame-by-frame. However, without considering temporal relationship of breast lesions, detection results on independent frame will be vulnerable to blurry boundary and interference from background which often result in wrong detection. Studies such as[10, 9] try to analyze temporal relationship by aggregating CNN features of the current frame and features extracted from FlowNet [2, 4] that predicts the x-y flow fields. The original FlowNet can effectively analyze the temporal relationship but suffers from the time-consuming traditional warpping operation, and the gap between natural images and medical images since FlowNet leverages pre-trained weights from a natural image dataset.
In this work, we propose an end-to-end semi-supervised breast lesion detection network based on temporal coherence and we denote it as STCD (i.e., semi-supervised temporal coherent detection) for convenience. In this work, we propose an end-to-end semi-supervised breast lesion detection network based on temporal coherence and we denote it as STCD for convenience. The STCD includes four parts as shown in Fig. 2: the proposed MotionNet for calculating the difference between two frames, the DecisionNet  for judging whether the current frame is one of the key frames or not, the proposed WarpNet for transforming key-frame features to the current frame, and the backbone pre-trained RetinaNet  for extracting CNN features and detecting.
The proposed network makes three contributions as following: 1) STCD offers a semi-supervised solution to train a network with unlabeled video data and a different set of labeled images. 2) STCD improves breast lesion detection accuracy by automatically analyzing the temporal relationship across frames (see Fig. 1c). 3) Due to the sparse distribution of key frames in the video, most of the video frames are processed by the proposed efficient MotionNet and WarpNet, which help the network achieve real-time performance.
2.1 STCD Network
Since breast ultrasound video is commonly acquired with a certain structural direction at a relatively consistent pace, changes across frames are gradual and coherent most of the time. A frame which is different from adjacent frames in a specific time period is denoted as key frame in this paper. Key frames in a sequence usually contain important information of the entire video. To be noted, the first frame in a video is always taken as the key frame.
As shown in Fig. 2, we extract features of the current key frame with the backbone network pre-trained from labeled images. Given such key frame and an input frame , the MotionNet generates features which encodes the motion difference between the two frames. Then the key-frame scheduling mechanism cooperated with the DecisionNet 
is adopted to classify the input frame, as a key frame or non-key frame. DecisionNet predicts a scorewhich indicates the consistency between and . If is larger than a pre-defined threshold , then is classified as the next key frame and processed by the backbone network to extract features. Otherwise, is treated as a non-key frame and the current key-frame features along with are combined as the feature representations, as is similar to .
In a video, for the sparse and specific key frames, a large backbone with excellent detection performance is used to extract features. For frames similar to the historical key frame, a lightweight MotionNet and WarpNet can generate features relying on key frame. STCD improves the ability of adaptive temporal coherent analysis and significantly increases the running speed.
2.2 Semi-supervised Training Strategy
As shown in Fig. 3, in order to utilize unlabeled videos and labeled images at the same time, we adopt a semi-supervised training strategy. To prepare training frame pairs composed of a key frame and a non-key frame for the STCD network, we randomly select a certain proportion of the frames as the key frames. Each of the above key frames is paired with a non-key frame in 20 adjacent frames.
During the training phase, we first train the detection network with the labeled image dataset. Then, we transfer the backbone of the detection network as the key-frame feature extractor without updating the weights. MotionNet, WarpNet, and DecisionNet are initialized with random parameters. At the beginning, features provided by MotionNet are not representative for breast lesions so that the DecisionNet based on MotionNet’s features can not be effectively trained. Considering this phenomenon, after training MotionNet and WarpNet branches using a warm-up strategy for several epochs, we jointly train DecisionNet with other parts together to reach training convergence of STCD.
We also design two loss functions: feature lossto measure correlation between and , and decision loss to achieve adaptive key-frame scheduling mechanism. The correlation score which calculates correlation between and can be defined as:
where is the total number of pixels in the feature maps, and is the index of a pixel. calculates mean values of series of in a batch, which forces features extracted from MotionNet and WarpNet to be close to features extracted from the labeled images. enforces the network’s output to be close to which can be summarized as:
2.3 MotionNet and WarpNet
As an important addition to STCD, MotionNet and WarpNet are mainly used to extract features of non-key frames. In a video, most of the frames are non-key frames whose detection accuracy and speed affect the detection results of the entire video significantly. Inspired by , we design the MotionNet to calculate the motion between two frames. Considering the limited training data and inference efficiency, we reduce the output channel of each layer in the original FlowNetS  structure to 1/4. MotionNet with less parameters effectively prevents over-fitting and increases speed. On the contrary, traditional feature transformation methods [9, 10] transform all the pixels of the entire feature map pixel-by-pixel with the displacement information. Such process is time-consuming due to the large number of feature map pixels.
Regarding the above shortcomings, we design a lightweight WarpNet to complete the transformation of the feature map. First, the input feature map is adaptively resized to the same size of the target feature map via a 33 convolutional operation. Subsequently, a weighted combination operation is performed by a 11 convolution operation. The lightweight WarpNet transforms feature map in a learnable non-linear style which improves both the accuracy and efficiency.
3.1 Datasets and Training
We collected 5,608 labeled breast lesion ultrasound images, 80 unlabeled sequences and 10 labeled sequences from 90 patients, where each video sequence contains 105 to 188 frames. Labeled ultrasound images are used to train the detection network such as RetinaNet  and YOLOV3 . Together with 80 unlabeled videos, a hybrid dataset is used to train STCD and 10 labeled sequences are used to be the test set. We select frames of the unlabeled training sequences as the key frames to form 59,962 training pairs. We train STCD with Adam  optimizer. In the first 10 epochs, we set the learning rates of DecisionNet, MotionNet, and WarpNet as 0, 0.001, and 0.001, respectively and weights are reduced by 10% after each epoch. After 10 epochs, we change weight of DecisionNet to 0.001 to turn it on. We conduct experiments on both CPU (Intel Xeon E5-2680) and GPU (Nvidia Tesla P40).
3.2 Effectiveness of STCD
In this section, we design several experiments to quantitatively (see Table 1) and qualitatively (see Fig. 4) evaluate the effectiveness and efficiency of different backbone networks. As illustrated in Table 1, STCD with different backbones improves the accuracy in a range between 1.5% and 5.4%, GPU speed increment between 32% and 68% and CPU between 69% and 151%. Together with the qualitative results shown in Fig. 4, we can see that STCD brings more historical target structural information by temporal coherence analysis compared to the frame-based detection. On the other hand, as the depth of network increases, the accuracy and efficiency improvement from STCD increases. This shows that the features extracted from the key frames by the large network are more effective which can further improve detection performance of non-key frames. Therefore, we select RetinaNet with ResNet-152 as detection backbone in our proposed STCD.
|Method||Backbone||mAP(%)||GPU Runtime (ms)||CPU Runtime (ms)|
To further evaluate the impact of unsupervised data volume on performance, we used different amounts of unlabeled data to train STCD for comparison in Fig. 5. As shown in Fig. 5, as the amount of unsupervised data increases, the detection accuracy gradually increases. Such phenomenon demonstrates that the unlabeled video data brings useful information for video detection using the semi-supervised mechanism introduced by STCD.
3.3 Effectiveness of WarpNet and MotionNet
We design three sets of comparative experiments: proposed MotionNet with WarpNet, MotionNet with traditional warp transformation, and FlowNet  with transformation. As shown in Fig. 6, with different decision thresholds for key frames, both MotionNet and WarpNet can bring accuracy improvement compared to FlowNet and Warp operation.
It shows that MotionNet and WarpNet that are based on an end-to-end network are more adaptive. Furthermore, FlowNet highly relies on pre-trained weights from Flying Chairs natural image dataset  which is different from medical images. Compared to FlowNet and untrainable warp operation, MotionNet and WarpNet can obtain more breast lesion information during semi-supervised training so as to improve detection accuracy. Since WarpNet can adaptively transform features with limited CNN parameters, WarpNet brings nearly a 1x speed increase over the previous Warp operation. The compressed MotionNet also has a speed increase of approximately 40% compared to FlowNet.
3.4 Ablation Study
To further explore the impact of different key-frame scheduling and architectures of MotionNet, two sets of experiments are conducted as follows: 1) we compare our proposed key-frame scheduling method based on DecisionNet with the fixed key frame selection stragety  (Fixed), key frame selection stragety based on mean squared error between two grey images  (Grey Correlation) and strategy based on optical-flow feature map between two images (Flow-guided Correlation); and, 2) we compare our MotionNet concatenating with MotionNet-b based on FlowNetC  and MotionNet-c based on FlowNet2 . Both MotionNet-b and MotionNet-c are compressed by cutting half number of channels in each layer. For training efficiency, both experiments are performed on RetinaNet with ResNet-101.
As shown in Fig. 7a, our approach surpasses other methods both in accuracy and speed. Furthermore, our method has a 23% increase in accuracy compared to the Fixed strategy, and a 61% increase in speed compared to the Grey Correlation approach, which demonstrates that our method can adaptively find target difference between two frames, resulting in better yet less key frames so as to improve video detection accuracy and efficiency. As our proposed method introduces the detection supervision information extracted by pre-trained detection backbone, it leverages more spatially specific information which can improve detection accuracy.
In addition, as shown in Fig. 7b, the proposed MotionNet has slightly improved accuracy and speed compared to the MotionNet-b with reduced convolutional parameters. Compared with the MotionNet-c, it has about 1.3% increase in accuracy and a 50% increase in speed. On the contrary, MotionNet-c contains a large number of repetitive modules, and the amount of network parameters has increased dramatically. With limited training data, it is easy to cause over-fitting and performance degradation.
In this work, we proposed the STCD network to detect breast lesions in ultrasound videos. By employing the coherence between adjacent frames, STCD can improve the detection accuracy with assistant of information from the historical frame. The semi-supervised learning approach introduced more video information and improved detection accuracy. Meanwhile, runtime efficiency was also improved significantly.
-  Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: ECCV. pp. 850–865. Springer (2016)
-  Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: ICCV. pp. 2758–2766 (2015)
-  Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37(3), 583–596 (2015)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: CVPR. pp. 2462–2470 (2017)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the ICCV. pp. 2980–2988 (2017)
-  Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
-  Xu, Y.S., Yang, H.K., Fu, T.J., Lee, C.Y.: Dynamic video segmentation network. In: CVPR (2018)
-  Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: CVPR. pp. 7210–7218 (2018)
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition. In: CVPR. pp. 2349–2358 (2017)