In recent years, there has been significant progress in minimally invasive robotic surgery and computer-assisted microsurgery. Semantic segmentation of surgical instrument plays a crucial role in assisted surgery. It can accurately locate the surgical instrument and estimate its pose, which is essential for surgical robot control[Allan et al.2019]. Furthermore, the mask generated by semantic segmentation offers numerous solutions to assist surgery, such as real-time surgical reminder, objective assessment of surgical skills, surgical report generation and surgical workflow optimization [Sarikaya et al.2017]. These applications can improve the safety of surgery and reduce the workload of doctors.
Compared with common object segmentation, the complex surgical scenes make accurate instrument segmentation more challenging. The first difficulty is the large illumination variation caused by the different light angles and occlusions. As shown in Figure 1 (a), surgical instruments tend to be whiter under specular reflection while the shadow makes instruments and background black. These problems seriously affect the visual representation of surgical instruments such as color and texture, impeding stably identifying the instrument. The second difficulty is the large scale variation caused by continuous movement and view changes. It leads to different shapes and scales of the same instrument. For example, the incision knife is in the shape of a triangle when the scale is small and in the shape of a polygon when the scale is large in Figure 1 (b). This issue makes the instrument segmentation more challenging.
Recently, a series of methods have been proposed for the semantic segmentation of surgical instruments. RAUNet [Ni et al.2019] designed an attention module to fuse multi-level feature maps and emphasize the target region. A hybrid CNN-RNN method [Attia et al.2017]
introduced Recurrent Neural Network to capture global contexts and expand the receptive field. MF-TAPNet[Jin et al.2019] adopted optical flow as temporal prior to provide a reliable indication of the instrument location and shape for accurate segmentation. ToolNet-C combined with the kinematic pose information to get the accurate silhouette mask [Qin et al.2019]. However, most of those work focuses on expanding the receptive field and capturing shape prior while fail to address the scale variation and illumination variation issues.
To address the issues mentioned above, we reconsider the features affected by them. Illumination variation affects the color and texture appearance, making identifying instruments harder. Considering that a surgical instrument is spatially continuous, we can infer the target region according to its neighbor pixels based on semantic dependency and global context. To this end, a bilinear attention module (BAM) is proposed, which is based on bilinear pooling to model semantic dependencies and aggregate global contexts. Bilinear pooling can capture second-order feature statistics to encode complex semantic dependencies, helping to improve feature representations. Furthermore, attention features generated by bilinear pooling are adaptively distributed to each location, making every pixel feel global contexts. In this way, semantic features in reflective or shaded areas can be inferred based on semantic dependencies and global contexts, dealing with the illumination variation.
Besides, the scale variation changes the shape and size of surgical instruments. Thus, we propose an adaptive receptive field module (ARF) to select and merge receptive fields of different scales adaptively. By doing so, we can cover various scales and make predictions more reliable. Specifically, ARF includes two branches. The former learns semantic relationships among channels, and the latter aggregates multi-scale features. Channel-wise semantic relationships are applied to select feature maps with appropriate sizes. Since kernels with the same size have different receptive fields on feature maps with various sizes, this module can select appropriate receptive fields for instruments at different scales by selecting specific feature maps, adapting to the scale variation. Moreover, dense connections across scales are introduced to propagate multi-scale features, which can cover a larger scale range.
Based on the above analysis, the bilinear attention network with adaptive receptive field, named BARNet, is proposed. The contributions of this work are as follows:
We propose the bilinear attention module to model semantic dependencies and aggregate global contexts for inferring the semantic features in the challenging region.
We design the adaptive receptive field module to select the appropriate receptive field adaptively, adapting to the scale variation of instruments.
The proposed network achieves state-of-the-art performance 97.47% mIOU on Cata7 and comes first place on EndoVis 2017 by 10.10 IOU overtaking second place.
2 Related Work
Attention mechanisms are widely used in semantic segmentation tasks. Some attention models extract attention features based on first-order operations such as global average pooling and convolution pooling. For instance, SENet[Hu et al.2018] applied global average pooling to capture global contexts and model semantic dependencies between channels. AGRNet [Zhang et al.2018] utilized convolution pooling to generate attention features. Besides, some works applied second-order models to encode complex semantic dependencies [Chen et al.2018b]. For example, A2Net [Chen et al.2018b] was based on bilinear pooling to capture second-order statistics and model semantic dependencies. The bilinear attention networks [Kim et al.2018] learned bilinear attention distributions, on the top of the low-rank bilinear pooling technique. These work suggested that bilinear models can capture abundant semantic relationships to improve feature representation. Different from the above methods, we design a novel encoder-decoder architecture to aggregate and distribute attention features, which can achieve better results.
2.2 Adaptive Receptive Field and Pyramid Features
Pyramid features play a critical role in the segmentation of multi-scale targets. A range of methods improved feature representation by aggregating multi-scale features. For example, PSPNet [Zhao et al.2017] adopted pyramid pooling to extract multi-scale features. DeepLabV2 [Chen et al.2018a] made use of dilated convolutions with different dilation rates to achieve multi-scale features. Feature pyramid network [Lin et al.2017b] laterally propagated multi-scale features for building feature maps with rich semantic information at all scales. However, these methods only concatenate multi-scale features together while they fail to select appropriate scales for a specific task. SKNet utilized different sized kernels to generate feature maps with different receptive fields and filter them [Li et al.2019], which provided a solution for the adaptive selection of receptive fields. Different from the above methods, we directly select multi-scale features instead of different size kernels and do not need to generate new features, reducing calculation costs. Besides, multi-scale features are generated by dense connections across scales, which can cover more scale ranges and boost the reuse of multi-scale features [Yang et al.2018].
Illumination variation leads to the change in color and texture of instruments. Scale variation varies the size and shape of instruments. To address these issues, a bilinear attention network with adaptive receptive field is proposed to capture semantic dependencies and global contexts for inferring semantic features in challenging areas and select receptive field adaptively for adapting to the scale variation. The overall network architecture is shown in Figure 2
. The bilinear attention module models semantic dependencies in small scale feature maps and generates attention features. The adaptive receptive field module aggregates multi-scale features and selects appropriate receptive field for instruments at different scales. Besides, cross-scale dense connections are introduced to propagate multi-scale features, which can also improve information flow and boost the reuse of features. Since the input of adaptive receptive field module should be multi-scale features, we do not use this model when the feature map is small. Residual Network pre-trained on the ImageNet is adopted as the backbone network.
3.2 Bilinear Attention Module
Illumination variation leads to the change in color and texture of surgical instruments. The network cannot utilize these features to identify surgical instruments, making it difficult to segment them. To solve this issue, the bilinear attention model is proposed to model semantic dependencies and captures global contexts. It is based on bilinear pooling to capture second-order statistics [Lin and Maji2017], which helps to boost the distinction between different semantic features. Thus, the bilinear attention module can encode more complex dependencies. Besides, a decoder is designed to distribute global contexts to each location adaptively. The bilinear attention module is shown in Figure 3, including three parts: encoding, normalization, and decoding.
The first step is to model semantic dependencies and capture global contexts. Bilinear pooling is utilized to achieve this goal. It calculates the outer product of the feature vector pairto capture second-order statistics and generate attention map . Each attention map represents the features of a pixel. These attention maps are concatenated together to encode spatial semantic dependencies. Then, sum pooling is performed to generate the global descriptor , as illustrated in Eq.(2). Semantic features in all locations are encoded into each element of the global descriptor, making each element feel global information. In this way, the bilinear attention module encodes semantic dependencies and aggregates global contexts.
where denotes the sum pooling. and . In this work, we set .
Then, the global attention map is normalized to further improve its feature representation. The element-wise signed square-root and normalization are performed to normalize it, which can improve performance in practice. Also, these operations are piecewise differentiable, which can be applied to end-to-end training [Lin and Maji2017].
The last step is to distribute attention features to each pixel of the input feature map, making semantic features of each pixel calibrated by global information. The input feature map is reshaped into . By applying matrix multiplication, attention features are distributed to each location of the input feature map. The attention feature map is achieved, as shown in Eq.(4).
The bilinear attention module allows each pixel of feature maps to feel global contexts. Semantic features in reflective or shaded areas can be inferred based on global contexts and semantic dependencies, dealing with the illumination variation issue. Besides, it encodes semantic dependencies in the form of second-order statistics, which boosts the distinction between various semantics and improves feature representation. Furthermore, this module only performs matrix operations and does not add any parameters, which can be easily inserted into other networks.
3.3 Adaptive Receptive Field Module
Since the surgical instrument is continually moving during the surgery, its shape and scale are constantly changing. Adaptive receptive fields can help the network adapt to scale variation and learn more detailed features. Thus, the adaptive receptive field module is proposed to aggregate multi-scale features and select the appropriate receptive field for instruments at different scales. The global average pooling is introduced to model the semantic relationship between channels and generate the weight vector. The weight vector can highlight feature maps which have an appropriate scale. Since kernels with the same size have different receptive fields on feature maps with different scales, by selecting feature maps in a specific size, the receptive field of subsequent convolutions can be determined. In this way, the adaptive receptive field module can select the receptive field adaptively according to the semantic relationship between channels.
The adaptive receptive field module is illustrated in Figure 4. Take an input feature map with five scales as an example. First, 11 convolution is performed on low-scale feature maps to adjust the channel dimension to , which contributes to the aggregation of multi-scale features and reducing computational costs. can be selected according to the complexity of the network. The maximum scale feature map does not compress the channels to preserve semantic features as much as possible. In this paper, we set to 8. Then, multi-scale features are fed into two branches, one of which models the semantic relationship between channels and another one aggregates multi-scale features.
Specifically, in the first branch, multi-scale features are fed into global average pooling to generate vectors that encode semantic dependencies [Hu et al.2018]. These vectors are concatenated together and go through two convolution layers to further extract the semantic relationship. In this way, we obtain the weight vector which represents the degree of semantic responses for different feature maps. In the second branch, multi-scale features are aggregated by upsampling and concatenation, generating the pyramid feature. The pyramid feature is multiplied by the weight vector to select the feature map with a larger response. In this way, it can adjust the receptive field of subsequent convolutions adaptively.
where refers to broadcast Hadamard product, represents the pyramid feature and denotes the attention vector.
where represents the -th low-scale feature map. denotes the 11 convolution and the upsampling. refers to concatenation.
3.4 Loss Function
To address the class imbalance issue, we use a hybrid loss consisting of cross-entropy and Dice [Ni et al.2019]. Cross-entropy is often used for classification tasks. However, it is easily affected by the class imbalance issue, which leads to poor training results. Thus, Dice loss is introduced to address this problem. Dice evaluates the similarity between the ground truth and the output, which has no relation to the ratio of foreground pixels to background pixels. It is not affected by the class imbalance issue. This hybrid loss merges cross-entropy and Dice in a new way to effectively utilize their excellent characteristics, which is shown in Eq.(7).
where refers to cross-entropy and denotes Dice loss. is a weight used to balance cross-entropy and Dice loss. It is set to 0.2 for best training results.
A cataract surgical instrument dataset, called Cata7, is used to evaluate our network. This dataset contains seven cataract surgery videos. To reduce redundancy, each video is downsampled from 30 fps to 1 fps. The resolution of the image is 19201080 pixels. The entire dataset contains 2500 images, 1800 of which are used for training and the others are used for testing. There are 10 cataract surgical instruments in this dataset.
EndoVis 2017 dataset is from 2017 MICCAI Endovis Robotic Instrument Segmentation Challenge. This dataset is based on endoscopic surgery. All videos are acquired by a Vinci Xi robot. It contains 3000 images with a resolution of 12801024, which contains 1800 images for training and 1200 images for the test. There are 7 types of surgical instruments in EndoVis 2017.
4.2 Implementation Details
All experiments are implemented on two Nvidia Titan X. The Residual Network pre-trained by ImageNet is used as the encoder. Adam with batch size 8 is used to train our network. The learning rate is dynamically adjusted during training. The initial learning rate on Cata7 dataset is and the initial learning rate on EndoVis 2017 dataset is . For every 30 iterations, the learning rate is multiplied by 0.8. Due to limited computing resources, each image in the Cata7 is resized to 960544 pixels, and images in EndoVis 2017 are resized to 640
512. Data augmentation is performed to increase sample diversity. The selected samples are randomly rotated, shifted and flipped. 800 images are obtained by data augmentation. To objectively evaluate our model, Dice and Intersection-over-Union(IoU) are selected as the evaluation metric.
4.3 Ablation Study Based on Cata7
4.3.1 Ablation Study for Bilinear Attention Module
Bilinear attention module (BAM) is introduced to capture the second-order statistics and model long-range semantic dependencies. To verify its performance, some experiments are performed, as shown in Table 1.
BARNet without BAM and ARF is used as the basic network. Compared with the basic network, the network using BAM has achieved an increase of 4.66 mean IOU and 2.69 mean Dice. When using ARF, employing BAM brings a 1.19 increase on mean IOU and 0.62 increase on mean Dice. It should be noted that BAM does not add any parameters, as shown in Table 1. These experiments demonstrate that BAM can significantly improve network performance without adding any parameters.
To further analyze the performance of the bilinear attention module, we visualize feature maps of its inputs and outputs in Figure 5. Compared with input feature maps, output feature maps of the bilinear attention module highlight the region containing the instrument, proving that BAM effectively models semantic dependencies and improves feature representation. Also, we visualize segmentation results of the network without BAM, which is shown in Figure 6 (e). There is incomplete segmentation in these results, and part of the surgical instrument is identified as background. The network employing BAM achieves excellent results, whose masks are relatively complete and the same as the ground truth.
4.3.2 Ablation Study for Adaptive Receptive Field Module
Adaptive Receptive Field Module (ARF) is designed to adaptively select the receptive field, adapting to the scale variation of instruments. A range of experiments are set up to verify its excellent performance.
As shown in Table 1, the network using ARF achieves an increase of 4.97 mean IOU and 2.94 mean Dice compared to the basic network. When using BAM, employing ARF brings a 1.50 increase on mean IOU and 0.87 increase on mean Dice. Furthermore, ARF only adds 0.1M parameters which only account for 0.46 of the basic network. To give a more intuitive result, we visualize some results of the network without ARF, which is shown in Figure 6 (f). The network without ARF has poor segmentation performance than BARNet, indicating the effectiveness of the ARF. The above results suggest that ARF can significantly improve segmentation accuracy with very few parameters.
|U-Net [Ronneberger et al.2015]||86.83||78.21||7.85M|
|RefineNet [Lin et al.2017a]||93.53||88.41||25.75M|
|LinkNet [Chaurasia and Culurciello2017]||94.63||91.31||21.81M|
|TernausNet [Iglovikov and Shvets2018]||96.24||92.98||25.36M|
|RAUNet [Ni et al.2019]||97.71||95.62||22.02M|
|Dataset 1||Dataset 2||Dataset 3||Dataset 4||Dataset 5||Dataset 6||Dataset 7||Dataset 8||Dataset 9||Dataset 10||mIOU|
4.4 Comparison with state-of-the-art on Cata7
A series of comparative experiments are performed to evaluate the performance of BARNet. BARNet achieves state-of-the-art performance 97.47 mean IOU and 98.68 mean Dice, exceeding the second-ranking method by 1.85 on mean IOU and 0.97 on mean Dice. The performance of other methods is much poorer than BARNet, which demonstrates the excellent performance of BARNet.
To further evaluate the segmentation performance of the proposed method for each type of surgical instrument, the confusion matrix about pixel classification is shown in Figure7. We find that our method achieves excellent performance on every type of instrument. Especially, the proposed method outperforms other methods by a significant margin on primary incision knife (I1), lens hook (I6) and bonn forceps (I10). The surface of these surgical instruments is prone to specular reflections due to their special material, making it more difficult to segment them. The bilinear attention module can model complex semantic dependencies and infer the semantic features in specular reflection and shadow regions, addressing the illumination variation issue. Thus, our network achieves better performance on these three instruments.
4.5 The Results on EndoVis 2017
To further verify the performance of BARNet, it is evaluated on the Endovis 2017 dataset [Allan et al.2019]. The test set consists of 10 video sequences. Each sequence contains specific surgical instruments. Datasets 1-8 contain 75 images, respectively. Dataset 9 contains 300 images, and the number of images is the same in dataset 10. The test results are reported in Table 3. TernausNet [Iglovikov and Shvets2018], ToolNet [García-Peraza-Herrera et al.2017] and SegNet [Badrinarayanan et al.2017] are evaluated on EndoVis 2017. The test results of other methods are from the MICCAI EndoVis challenge 2017 [Allan et al.2019].
BARNet achieves 64.30 mean IOU, which outperforms existing methods. The second-ranking method, TernausNet, achieves 54.20 mean IOU. Compare with this method, our network achieves 10.10 gain on mean IOU. The performance of other methods is much poorer than BARNet. Furthermore, BARNet achieves the best results in 7 video sequences and takes second place in the other three video sequences. These results show that BARNet achieves state-of-the-art performance on this dataset.
To give intuitive results, the segmentation results of BARNet are visualized in Figure 8. We find multiple specular reflections and shadow areas in the figure. Besides, the scale and shape of surgical instruments are significantly varied. Despite these challenges, BARNet can still accurately segment surgical instruments, whose segmentation results are basically consistent with the ground truth.
In this paper, the bilinear attention network with adaptive receptive field (BARNet) is proposed for surgical instrument segmentation. Bilinear attention module captures global contexts and second-order statistics to improve feature representation. Adaptive receptive field selects feature maps with specific sizes to choose appropriate receptive fields. A series of ablation experiments prove that the bilinear attention module and adaptive receptive field module contribute to improving network performance. Moreover, BARNet achieves state-of-the-art performance on both Cata7 and EndoVis 2017.
- [Allan et al.2019] M. Allan, A. Shvets, T. Kurmann, Z. Zhang, R. Duggal, Y.H. Su, N. Rieke, I. Laina, N. Kalavakonda, S. Bodenstedt, L.C. Garcia-Peraza-Herrera, W. Li, V. Iglovikov, H. Luo, J. Yang, D. Stoyanov, L. Maier-Hein, S. Speidel, and M. Azizian. 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426, 2019.
- [Attia et al.2017] Mohamed Attia, Mohammed Hossny, Saeid Nahavandi, and Hamed Asadi. Surgical tool segmentation using a hybrid deep cnn-rnn auto encoder-decoder. In 2017 IEEE International Conference on Systems, Man, and Cybernetics, pages 3373–3378, 2017.
- [Badrinarayanan et al.2017] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
- [Chaurasia and Culurciello2017] Abhishek Chaurasia and Eugenio Culurciello. Linknet: Exploiting encoder representations for efficient semantic segmentation. In IEEE Visual Communications and Image Processing, pages 1–4, 2017.
- [Chen et al.2018a] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, April 2018.
- [Chen et al.2018b] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A^2-nets: Double attention networks. In Advances in Neural Information Processing Systems 31, pages 352–361. 2018.
- [García-Peraza-Herrera et al.2017] Luis C García-Peraza-Herrera, Wenqi Li, Lucas Fidon, Caspar Gruijthuijsen, Alain Devreker, George Attilakos, Jan Deprest, Emmanuel Vander Poorten, Danail Stoyanov, Tom Vercauteren, et al. Toolnet: Holistically-nested real-time segmentation of robotic surgical tools. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5717–5722, 2017.
- [Hu et al.2018] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In , pages 7132–7141, June 2018.
- [Iglovikov and Shvets2018] Vladimir Iglovikov and Alexey Shvets. Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv preprint arXiv:1801.05746, 2018.
- [Jin et al.2019] Yueming Jin, Keyun Cheng, Qi Dou, and Pheng-Ann Heng. Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In Medical Image Computing and Computer Assisted Intervention, pages 440–448, 2019.
- [Kim et al.2018] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Advances in Neural Information Processing Systems, pages 1564–1574, 2018.
- [Li et al.2019] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- [Lin and Maji2017] Tsung-Yu Lin and Subhransu Maji. Improved bilinear pooling with cnns. In Proceedings of the British Machine Vision Conference, pages 117.1–117.12, 2017.
- [Lin et al.2017a] G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5168–5177, July 2017.
- [Lin et al.2017b] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- [Ni et al.2019] Zhen-Liang Ni, Gui-Bin Bian, Xiao-Hu Zhou, Zeng-Guang Hou, Xiao-Liang Xie, Chen Wang, Yan-Jie Zhou, Rui-Qi Li, and Zhen Li. Raunet: Residual attention u-net for semantic segmentation of cataract surgical instruments. arXiv preprint arXiv:1909.10360, 2019.
- [Qin et al.2019] F. Qin, Y. Li, Y. Su, D. Xu, and B. Hannaford. Surgical instrument segmentation for endoscopic vision with data fusion of cnn prediction and kinematic pose. In 2019 International Conference on Robotics and Automation (ICRA), pages 9821–9827, May 2019.
- [Ronneberger et al.2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241, 2015.
- [Sarikaya et al.2017] D. Sarikaya, J. J. Corso, and K. A. Guru. Detection and localization of robotic tools in robot-assisted surgery videos using deep neural networks for region proposal and detection. IEEE Transactions on Medical Imaging, 36(7):1542–1549, July 2017.
- [Yang et al.2018] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3684–3692, June 2018.
- [Zhang et al.2018] X. Zhang, T. Wang, J. Qi, H. Lu, and G. Wang. Progressive attention guided recurrent network for salient object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 714–722, 2018.
- [Zhao et al.2017] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6230–6239, July 2017.