Histogram of Oriented Gradients Meet Deep Learning: A Novel Multi-task Deep Network for Medical Image Semantic Segmentation

by   Binod Bhattarai, et al.

We present our novel deep multi-task learning method for medical image segmentation. Existing multi-task methods demand ground truth annotations for both the primary and auxiliary tasks. Contrary to it, we propose to generate the pseudo-labels of an auxiliary task in an unsupervised manner. To generate the pseudo-labels, we leverage Histogram of Oriented Gradients (HOGs), one of the most widely used and powerful hand-crafted features for detection. Together with the ground truth semantic segmentation masks for the primary task and pseudo-labels for the auxiliary task, we learn the parameters of the deep network to minimise the loss of both the primary task and the auxiliary task jointly. We employed our method on two powerful and widely used semantic segmentation networks: UNet and U2Net to train in a multi-task setup. To validate our hypothesis, we performed experiments on two different medical image segmentation data sets. From the extensive quantitative and qualitative results, we observe that our method consistently improves the performance compared to the counter-part method. Moreover, our method is the winner of FetReg Endovis Sub-challenge on Semantic Segmentation organised in conjunction with MICCAI 2021.



page 2

page 11

page 12

page 13


Deep Semantic Segmentation of Natural and Medical Images: A Review

The (medical) image semantic segmentation task consists of classifying e...

Rainy screens: Collecting rainy datasets, indoors

Acquisition of data with adverse conditions in robotics is a cumbersome ...

Mixed-Supervised Dual-Network for Medical Image Segmentation

Deep learning-based medical image segmentation models usually require la...

Semantic Segmentation from Remote Sensor Data and the Exploitation of Latent Learning for Classification of Auxiliary Tasks

In this paper we address three different aspects of semantic segmentatio...

Conv-MCD: A Plug-and-Play Multi-task Module for Medical Image Segmentation

For the task of medical image segmentation, fully convolutional network ...

Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral

While deep learning has been very beneficial in data-rich settings, task...

Polygonal Building Segmentation by Frame Field Learning

While state of the art image segmentation models typically output segmen...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: This Figure shows an input image (left) and its ground truth semantic segmentation map (left) for the primary task and the Histogram of Oriented Gradients of the input image (right). In the HOG map, we can observe the boundary between the organs and the instruments that belong to different semantic categories. Zoom in for a better view.

Medical image segmentation [18, 19, 31, 24] is an important and active research problem. The usage of semantic segmentation in several biomedical applications such as computer-assisted diagnosis [41], robotic surgery [5], radiotherapy planning and follow-ups [20], etc., is growing day by day. Due to this reason, the research community has witnessed an unprecedented growth of research interest in this domain. There are several types of semantic segmentation problems in medical imaging. Broadly, the existing semantic segmentation tasks can be grouped into four major categories viz. organ segmentation [13], robotic-instrument segmentation [21, 32], vessels segmentation [7], and cellar and sub-cellular segmentation [28], etc.

After the seminal work of  [15]

on large-scale image classification using deep convolutional neural networks, the use of deep architectures has not been limited only on computer vision 

[33, 36, 11]; it is equally popular in medical image analysis [35, 16]. With the usage of deep learning algorithms, the accuracy of computer vision tasks such as classification, segmentation, and detection is improving significantly [27]. A similar trend has been observed on medical image analysis too [2]

. We obtain the performance gain at the cost of many annotated examples (e.g. Imagenet consists of 1M annotated examples). It is evident that deep learning algorithms are data voracious and demand millions of training examples. Collecting data, in general, is time-consuming, needs experts and is also expensive. Moreover, in medical imaging, it is not only about collecting annotations as they come from highly trained experts, e.g. radiologists (e.g., MRI or CT scanner), but due to growing concerns on privacy, it is difficult to get the unlabelled examples  


To improve the generalisation of a model from a fixed amount of training examples, sharing the parameters between main task and auxiliary tasks [3] is popular for a long time. MaskRCNN [12] , one of the most popular networks in recent time, shares the parameters between detection and segmentation networks. Similarly, [37] proposed to predict contour as an auxiliary task while training a network for semantic segmentation as the primary task. The major drawback of these methods is a need of annotated examples for both the primary and the auxiliary tasks. Collecting such a heterogeneously labelled set of training examples is even more challenging in the medical image domain.

To tackle the problem of collecting training examples with the heterogeneous set of labels, we propose to generate pseudo-labels for the auxiliary task from the hand-crafted features instead. As one can extract hand-crafted features in an unsupervised manner, generating pseudo-labels of any type of images for an auxiliary task can be done easily. To this end, we leverage the Histogram of Oriented Gradients (HOGs) [6] to generate pseudo-labels. Demarcation of the organs and surgical instruments parts belonging to a common category from unrelated ones would play a significant role in their accurate segmentation. Auxiliary tasks focusing on such aspects would help the network to learn the robust representation for semantic segmentation. Thus, we chose HOGs to generate pseudo-labels for the auxiliary task as these features are carefully designed state-of-the-art hand-crafted features for object detection  [6]. However, any other type of hand-crafted features can be employed in our pipeline to extract the pseudo-labels. Figure 1 shows the HOGs map of eye anatomy and surgical instrument. In the Figure, we can see the demarcation of a surgical instrument from eye anatomies made by the map of the Histogram of Oriented Gradients. Once, we extract the HOGs, we consider these representations as annotations of the auxiliary task and the ground truth semantic map as annotations of the primary task. We extended existing popular architectures for semantic segmentation: UNet [29] and U2Net [25] to minimise the loss of both the auxiliary and primary tasks and train the network in a multi-task manner.

Use of image feature representations as a pseudo-label is growing these days. Recently,  [8]

trained a deep network to predict Bag of Visual Words (BoWs) for image classification. Unlike ours, this method relied on the learned features extracted from a network trained to minimize the image rotation angle loss. In medical imaging, organs such as the eye bulb, pupils, colons, etc., are either hollow and cylindrical or rotationally invariant. Hence, the pipeline is not directly applicable in medical imaging. In addition, they trained their method to minimise the objective function of a single task, whereas we train our pipeline in a multi-task set-up. We summerise our contributions in the following points:

  • We investigated the Histogram of Oriented Gradients to generate pseudo-labels of images and exploited these representations as labels of an auxiliary task.

  • We extended existing semantic segmentation networks to train in a multi-task framework.

  • We applied our method on two challenging medical semantic segmentation data sets. Our extensive experiments demonstrate that our pipeline consistently outperforms the counter-part single task networks.

2 Related Works

Our work falls into the category of deep multi-task learning with pseudo labels, self-supervised learning. In this Section, we summarise some of the important past works closely related to our method.

Deep Multi-task Learning for Semantic Segmentation: UNet [29] is one of the earliest and the most widely used deep architectures for medical image segmentation. This architecture is a supervised learning architecture and can handle only semantic maps as the ground-truth annotations. Another work on pancreas segmentation  [30] trains deep learning architecture in a multi-stage manner. It predicts the bounding box to localise the pancreas followed by fine-tuned semantic segmentation. Unlike our approach, this method uses ground truth annotations on both stages. In contrast, we rely on HOGs computed unsupervised and trained the model to minimise the losses jointly. Another work on brain lesion segmentation [14]

employs 3D Convolutional Neural Network with a fully connected Conditional Random Field. Similarly,  

[17] employ self co-attention to improve the performance of anatomy segmentation in whole breast ultrasound. However, these methods consider only semantic segmentation maps for ground truth. One of the recent works on tumours segmentation in 3D breast ultrasound images [42] proposed to train CNN in multitasking fashion.  [38] modified UNet architecture to jointly minimise the segmentation and classification loss in ultra-sound images.  [39] trained multi-stage multitask learning framework for breast tumour segmentation in ultrasound images. [34] learns the parameters of network to minimize the loss for skin lesion detection, classification, and segmentation.  [4] trained a multi-task learning CNN for semantic segmentation and image level glaucoma classification. Another work on histopathology image analysis [26] trained a multi-task network for nucleus classification and segmentation. All of these methods need ground truth annotations for both the main task (semantic segmentation) and auxiliary tasks. Whereas, in our case, we have annotations for the primary task and generate pseudo-labels for the auxiliary task.

Self-supervised Learning: In Self-supervised learning, the annotations for the pre-text tasks are generated in an unsupervised manner. In general, the parameters of a CNN are learned to minimise the loss of pre-text tasks followed by fine-tuning of the parameters for the downstream tasks. Several different ways are investigated in the past years to generate the annotations of pre-text tasks. These includes, image rotation angle [9]

, colorization 

[40], image-patch context [22], in-painting [22], etc. These methods mostly pivot on the geometric transformations of the images. What kind of pre-text task is going to be the most useful for the end-task is still an open research problem. Recently, [8] proposed to learn the representations by predicting the visual Bag of Words (BoW). This method, closest to ours, rely on visual features to generate the pseudo-labels. As we mentioned before, they compute BoWs from the visual representations extracted from model trained to minimise the rotation angle of an image. Thus, this approach is not directly applicable to our applications as most of the organs such as eyes, eye-bulb exhibit rotationally invariant shape. Unlike most of the self-supervised pipeline, we propose to minimise the loss of end-task and pre-text task jointly.

3 Proposed Method

In this Section, we present our pipeline in detail. We start with the description of HOGs followed by the generation of pseudo-labels for the auxiliary task. Afterwards, we explain our approach to extend a single-task semantic segmentation network to a multi-task network. Finally, we explain the overall objectives.

We have a scenario where represents input image space and represents output semantic map space. Our goal is to learn a function with a given training examples . In the training set , is total number of training examples, , , where, represents width, height, and total number of channels in an image respectively. Our contribution lies in generating extra annotations of the images in an unsupervised way and extending the single task semantic segmentation network to train in a multi-task manner to improve the performance of semantic segmentation. We make use of HOGs to extract the pseudo-annotations of an image.

3.1 Histogram of Oriented Gradients as Pseudo Labels

It is proven that the HOGs [6] were one of the most powerful hand-features on computer vision and medical image analysis especially for detection before the advent of data driven feature extraction methods such Alexnet [15], ResNet [11], and UNet [29]. In this paper, we use HOGs for a novel cause i.e. to extract the pseudo-labels of the images. To compute HOGs from an image, first of all, we crop and resize the images to the desired dimensions of width, and height, . We further divide the images into a non-overlapping image patches of width , and height , resulting the total number of patches of . For each of the patches, we run 1-D discrete derivative masks centred around a pixel in both the horizontal and vertical directions. and are horizontal and vertical filtering kernels respectively. We run these filters on all the pixels of every image patches as shown in Figure 3.

Figure 2: This diagram shows the overall proposed framework. In the Figure, the main network corresponds to semantic segmentation network (e.g. U2Net), while the auxiliary network is our contribution to extend the single task network to a multi-task network. Training examples in triplet, i.e. input image, ground truth semantic map and pseudo-label computed from HOGs, are fed into the network and train the network jointly.

After applying the kernels centred on every pixels, we compute the histogram of gradients for all the patches and append them together. Gradients are computed as , and the gradients are assigned to the nearest bin. The histogram can have number of bins with angle ranging from 0 to 180 degrees. The magnitude of the gradient is computed as

. This magnitude of the gradients encodes the frequency of a bin of the gradient taken into consideration. In this manner, we estimate the histogram of oriented gradients in every patch. The number of the bins and the patches determine the dimension of the HOGs and are the hyper-parameters in our study. We present their studies in Experimental Section in depth. We concatenate the HOGs for all the patches of an image, and the final representations of HOGs are the pseudo-label,

of the image. We augment the pseudo-label on the given training set. Thus, the training set with augmented pseudo-labels become which we use to train the semantic segmentation network in multi-task setup.

3.2 Multi-task semantic segmentation with pseudo labels

For an input image with the ground truth semantic segmentation map and its pseudo-label , we train a semantic segmentation network in a multi-task learning fashion. The primary task for us is to predict the semantic map and the secondary task is to regress the Histogram of Oriented Gradients (HOGs). To predict the semantic map we employ categorical cross-entropy loss and minimise mean squared loss to predict the HOGs. As mentioned before, UNet and U2Net are two most popular and the powerful semantic segmentation networks in medical imaging. However, these networks are originally designed to support semantic map as only ground truth. Thus, these networks can not readily handle our heterogeneously labelled training examples. To enable them to handle pseudo-labels and share the parameters between these tasks, we proposed to add a regression unit with two convolutional layers and a fully connected layer on every layers of the decoder side on U2Net as shown in the Figure 2. On UNet, we added only one such unit on bottleneck. It is because, UNet has relatively less parameters compared to U2Net. In the Figure 2, the left hand block depicts the U2Net architecture and the right hand side block shows the regression units we introduced in the architecture. The regression units learn the parameters predicts HOGs correctly. In the similar manner, we plugged in regression units on UNet. Compared to UNet, U2Net is also an hourglass architecture where each layer consists of a UNet. We learn the parameters of the the whole architecture to minimize the following objective.

Figure 3: Diagram showing the pipeline to extract the Histogram of Oriented Gradients (HOGs). Zoom in for better view.
Input Shape Operations

Conv(3,3,1), ReLU(), MaxPool2d(2,2)

(3, ) Conv(3,3,1), ReLU(), MaxPool2d(2,2)
(3, ) Flatten()
() Linear(504)
Table 1: Architecture of the Auxiliary Task Network to Regress HOGs.

In Equation 1, is the primary task loss i’e minimization of cross-entropy loss to predict the ground truth mask correctly. Whereas, is loss of secondary task to predict the HOGs of the input image. We minimize the mean squared error between the predicted and ground truth HOGs. and are two hyper-parameters to weight the contributions of each of the losses to best generalise the model parameters on unseen data for semantic segmentation. We fine-tune these parameters by doing cross-valiation on validation set. The details are on Section 4.

4 Experiments

Data sets: We evaluated our methods on two different publicly available challenging data sets with diverse characterstics. CaDIS data set  [10] was released in MICCAI 2020 in one of the EndoVis challenges. It consists of 25 surgical videos. Each video frame is annotated broadly into eye anatomies, surgical instruments, and miscellaneous categories. Based on the granularity of the segments,  [10] designed the challenge into three different tasks. Task 1 consists of 8 different segments: four eye anatomies, three misc objects, and one instrument category. In Task 2, the instrument category is further split into nine classes, resulting in 17 different categories. Finally, in Task 3, there is an increase in granularity on the handles of the surgical instrument. This further increase in granularity resulted in 25 different categories to segment. There are 3,550 annotated frames in train set, 534 in validation set, and 586 are in test set.

Another data set on which we evaluated our method is Robotic Instrument Segmentation [1]. This data set is publicly available for research since MICCAI 2017 challenge. The main task on this data set is to segment surgical instruments from the background. Based on the granularity of segmentation of the parts of the surgical instruments, three tasks were designed in the challenge. Task 1 is to segment the instruments as a whole from the rest of the background. Similarly, the challenge in Task 2 is to segment the instruments into wrist, jaw, and shaft and distinguish the instrument from the background. Finally, Task 3 further segments the instrument into seven parts and segregates it from the background. There are 10 different folds of videos in total. Following the evaluation protocol described on  [1], we report performance on folds 9 and 10 and train on rest of the videos.

Baselines Architectures: We took UNet [29] and U2Net [25], two representative architectures, for semantic segmentation and employed our method on these two architectures. Since our method is generic in nature, we can easily extend to other architectures. UNet is one of the most widely used architectures in medical image segmentation. It is a lightweight architectures consisting of encoder and decoder. Encoder consists of convolutional and pooling layers that map high-dimensional images into low-dimensional latent space. Decoder feeds in the latent representations of the image and learns the parameters to predict the correct semantic maps. There are skip connections from encoder layers to decoder layers.

U2Net is another recently proposed architecture with state-of-the-art performance on multiple computer vision semantic segmentation benchmarks. Similar to UNet, this is an hourglass architecture with skip connections between the encoder and decoder layers. Compared to UNet, U2Net consists of UNet like structures in every layer of encoders and decoders and also known as UNet inside UNet. Thus, the learning parameters in this architecture are much higher than UNet.

Evaluation Metrics: We used mean Intersection of Union (mIoU) to compare the quantitative performance. Intersection of Union (IoU) is computed as follows:

In addition to this, we also present extensive qualitative analysis to make the comparisons.

Implementation Details:

We implemented our algorithms on PyTorch framework. For optimization, we employ Adam Optimizer. We set the initial learning rate to 2e-4 and scaled it by a factor of 0.5 in every 50k iteration. We train our algorithms for 150K iterations and validate every 1k iterations. We save the best performing checkpoint on the validation set and report the performance on the test set.

Task # Classes Validation set mIOU Test set mIOU
MICCAI’21 U2Net +HOG (Ours) MICCAI’21 U2Net +HOG (Ours)
1 8 86.7 84.9 85.5 83.7 80.2 81.4
2 18 72.7 83.8 84.1 70.6 77.8 80.2
3 26 66.6 82.1 83.0 59.2 78.2 78.4
Table 2: Summary of quantitative performance comparison on CaDIS data set.
Task # Classes mIOU on test Video 9 mIOU on test Video 10
MICCAI’17 U2Net +HOG (Ours) MICCAI’17 +HOG (Ours) Ours
1 2 87.7 94.2 95.6 91.7 96.0 96.2
2 4 73.6 70.8 75.8 80.7 84.1 84.4
3 8 35.7 57.9 65.4 79.1 89.4 91.3
Table 3: Summary of quantitative performance comparison on Robotic Instrument Segmentation data set.

Hyper-parameter Selection: There are two critical sets of hyper-parameters in our proposed pipeline. The first one is the weights of the primary loss () and the secondary loss () as shown in Equation 1. Another hyper-parameter is the dimension of HOGs. We estimated the values of these hyper-parameters by doing cross-validation on Validation Set. Table 4 summarises the cross-validation for weighing the contributions of the proposed losses on CaDIS validation set. We observed that setting equal contribution to the losses gives us optimal performance. We observed a similar trend on another benchmark too. This outcome also highlights the significance of the proposed auxiliary loss in our pipeline. We set the values of and equal to 1 in the rest of the experiments. Similarly, Figure 5 shows the performance on CaDIS Validation Set with varying the dimension of the HOGs. We can see the highest performance with the dimension of 502, which we set for the rest of the experiments.

Weight of losses mIOU
0.01 1.0 81.2
0.1 1.0 82.1
1.0 1.0 82.3
1.0 0.1 81.7
1.0 0.01 82.0
Table 4: Ablation study on weights of losses.
Figure 4: A performance comparison with varying sizes of training data.
Figure 5: A performance comparison with varying dimensions of HOGs.
Figure 6: Qualitative comparison between the proposed method with its counter-part architecture U2Net on three different tasks. First two rows represent examples from Task 1, the middle two rows, and the last two rows are examples from Task 2 and Task 3 respectively.
Figure 7: Qualitative comparison between before and after applying our method on U2Net in the Task 1 of robotic instrument segmentation challenge held in MICCAI 2017.
Figure 8: Qualitative comparison between before and after applying our method on U2Net in the Task 2 of robotic instrument segmentation challenge held in MICCAI 2017.
Figure 9: Qualitative comparison between before and after applying our method on U2Net in the Task 3 of robotic instrument segmentation challenge held in MICCAI 2017.

Quantitative Evaluations: Here, we present the outcomes from our extensive experiments on two different data sets: CaDIS and Robotic Instrument Segmentation. As mentioned before, each of the benchmarks consists of three tasks resulting in six different tasks from two data sets. We extended our method on two popular baseline architectures: UNet and U2Net. We evaluate the empirical performance on the mean Intersection of Union (mIOU).

Compared to U2Net, UNet is more efficient but is less accurate. We evaluate both the architectures on CaDIS Task 2. We choose this task due to the good trade-off of granularity and the number of training examples per category. In this experiment, UNet and U2Net obtained 81.9% and 83.75% mIOU, respectively. We also took a different proportion of training examples and compared the performance of UNet with/out the auxiliary task to predict HOGs. Figure 4 summerizes our experiments. Our technique to extend UNet to a multi-task network improves the performance consistently. This gain in performance also shows that our method equally generalizes on varying sizes of training examples. For experiments on the remaining tasks from both the data sets, we decided to choose U2Net as our baseline architecture as its performance is clearly superior to UNet.

Table 2 summarises the performances of three different tasks on CaDIS data set. We have compared our performance with the winner of the MICCAI 2021 challenge and U2Net. From the Table, we can see that our method consistently outperforms the U2Net on both the validation set and the set. Out of 6 different scenarios, our method obtained the highest mIOU on 4 cases, slightly lagging behind the winner of MICCAI’21 challenge on Task 1. Compared to Task1, on Task 2 and Task 3, the mIOU of the winning method on MICCAI’21 dropped by a large margin (-20%). In contrast, our cases have a slight drop in performance (-2.0%). This shows the robustness of the proposed pipeline over the increase in granularity of the segmentation tasks.

Similarly, Table 3 details the performance comparison on Robotic Instrument Segmentation. We followed the evaluation protocol presented on the challenge paper and compared our performance with the winner model. In every task, our method obtained the highest mIOU surpassing the winning team’s performance and our baseline U2Net by a large margin. With the increase in the granularity in the segmentation task, the mIOU of the winner method drops by up to -50%. At the same time, the drop in our method is only up to -30.2%. Again, this is yet another evidence for our method being robust compared to the contemporary methods.

Qualitative Evaluations: We did not limit our experiments to quantitative evaluations only. To deeper understand our method’s role in improving the performance of existing architecture such as U2Net, we performed an extensive qualitative analysis. Figure 6 shows the qualitative comparisons of Task 1 , Task 2, and Task 3 on CaDIS data set. The bounding boxes locate some of the representative regions on the eye and the surgical instrument where U2Net fails, but our method correctly segments it. From these locations, we can see that the characteristics of HOGs to identify the organs and tools boundary play a crucial role in correctly segmenting the organs and the semantic parts of the surgical tools.

Similarly, Figure 7, Figure 8, Figure 9 show the qualitative comparison of Task 1, Task 2, and Task 3 on robotic instrument segmentation. In these qualitative analyses, we observe the similar trends that were seen on CaDIS data set. As we can see from these analysis, U2Net struggles quite a lot on boundary regions. Our method enables correct segmentation on such regions that we can see in our qualitative comparisons. The red bounding boxes on the Figures locates the failed cases by the baseline, whereas the green bounding boxes show the correction made by our method.

5 Conclusions and Future Works

In this paper, we present a novel multi-task deep learning framework for medical image segmentation. We generate the annotations of the auxiliary task in an unsupervised manner. We leverage Histogram of Oriented Gradients of images as their labels. We train the deep network jointly to minimise the losses of both the primary task, which is semantic segmentation and the auxiliary task. From our extensive qualitative and quantitative experiments on two challenging medical image segmentation benchmarks, we observe the proposed pipeline’s performance superior to its counter-part single task network. In the future work, we plan to explore the higher-order statistics of hand-crafted features such as Fisher Vectors as annotation of images to train the multi-task deep semantic network.

6 Acknowledgement

This project is funded by the EndoMapper project by Horizon 2020 FET (GA 863146). For the purpose of open access, the author has applied a CC BY public copyright licence to any author accepted manuscript version arising from this submission.


  • [1] M. Allan, A. Shvets, T. Kurmann, Z. Zhang, R. Duggal, Y. Su, N. Rieke, I. Laina, N. Kalavakonda, S. Bodenstedt, et al. (2019) 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426. Cited by: §4.
  • [2] S. M. Anwar, M. Majid, A. Qayyum, M. Awais, M. Alnowami, and M. K. Khan (2018) Medical image analysis using convolutional neural networks: a review. Journal of medical systems 42 (11), pp. 1–13. Cited by: §1.
  • [3] R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: §1.
  • [4] A. Chakravarty and J. Sivswamy (2018) A deep learning based joint segmentation and classification framework for glaucoma assesment in retinal color fundus images. arXiv preprint arXiv:1808.01355. Cited by: §2.
  • [5] E. Colleoni, P. Edwards, and D. Stoyanov (2020) Synthetic and real inputs for tool segmentation in robotic surgery. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 700–710. Cited by: §1.
  • [6] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In CVPR, Cited by: §1, §3.1.
  • [7] M. M. Fraz, P. Remagnino, A. Hoppe, B. Uyyanonvara, A. R. Rudnicka, C. G. Owen, and S. A. Barman (2012) Blood vessel segmentation methodologies in retinal images–a survey. Computer methods and programs in biomedicine 108 (1), pp. 407–433. Cited by: §1.
  • [8] S. Gidaris, A. Bursuc, N. Komodakis, P. Pérez, and M. Cord (2020) Learning representations by predicting bags of visual words. In CVPR, pp. 6928–6938. Cited by: §1, §2.
  • [9] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR, Cited by: §2.
  • [10] M. Grammatikopoulou, E. Flouty, A. Kadkhodamohammadi, G. Quellec, A. Chow, J. Nehme, I. Luengo, and D. Stoyanov (2021) CaDIS: cataract dataset for surgical rgb-image segmentation. Medical Image Analysis. Cited by: §4.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §3.1.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §1.
  • [13] P. Hu, F. Wu, J. Peng, Y. Bao, F. Chen, and D. Kong (2017) Automatic abdominal multi-organ segmentation using deep convolutional neural network and time-implicit level sets. International journal of computer assisted radiology and surgery 12 (3), pp. 399–411. Cited by: §1.
  • [14] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, and B. Glocker (2017) Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis 36, pp. 61–78. Cited by: §2.
  • [15] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1, §3.1.
  • [16] J. Lee, S. Jun, Y. Cho, H. Lee, G. B. Kim, J. B. Seo, and N. Kim (2017) Deep learning in medical imaging: general overview. Korean journal of radiology 18 (4), pp. 570–584. Cited by: §1.
  • [17] B. Lei, S. Huang, H. Li, R. Li, C. Bian, Y. Chou, J. Qin, P. Zhou, X. Gong, and J. Cheng (2020) Self-co-attention neural network for anatomy segmentation in whole breast ultrasound. Medical image analysis 64, pp. 101753. Cited by: §2.
  • [18] T. Lei, R. Wang, Y. Wan, X. Du, H. Meng, and A. K. Nandi (2020) Medical image segmentation using deep learning: a survey. arXiv e-prints. Cited by: §1.
  • [19] F. Milletari, N. Navab, and S. Ahmadi (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pp. 565–571. Cited by: §1.
  • [20] T. Nemoto, N. Futakami, M. Yagi, E. Kunieda, T. Akiba, A. Takeda, and N. Shigematsu (2020) Simple low-cost approaches to semantic segmentation in radiation therapy planning for prostate cancer using deep learning with non-contrast planning ct images. Physica Medica 78, pp. 93–100. Cited by: §1.
  • [21] D. Pakhomov, V. Premachandran, M. Allan, M. Azizian, and N. Navab (2019) Deep residual learning for instrument segmentation in robotic surgery. In International Workshop on Machine Learning in Medical Imaging, pp. 566–573. Cited by: §1.
  • [22] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting. In CVPR, Cited by: §2.
  • [23] J. Peng, P. Wang, C. Desrosiers, and M. Pedersoli (2021) Self-paced contrastive learning for semi-supervised medical image segmentation with meta-labels. In NeurIPS, Cited by: §1.
  • [24] D. L. Pham, C. Xu, and J. L. Prince (2000) Current methods in medical image segmentation. Annual review of biomedical engineering 2 (1), pp. 315–337. Cited by: §1.
  • [25] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jagersand (2020) U2-net: going deeper with nested u-structure for salient object detection. Pattern Recognition 106, pp. 107404. Cited by: §1, §4.
  • [26] H. Qu, G. Riedlinger, P. Wu, Q. Huang, J. Yi, S. De, and D. Metaxas (2019) Joint segmentation and fine-grained classification of nuclei in histopathology images. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 900–904. Cited by: §2.
  • [27] W. Rawat and Z. Wang (2017) Deep convolutional neural networks for image classification: a comprehensive review. Neural computation 29 (9), pp. 2352–2449. Cited by: §1.
  • [28] A. Rizk, G. Paul, P. Incardona, M. Bugarski, M. Mansouri, A. Niemann, U. Ziegler, P. Berger, and I. F. Sbalzarini (2014) Segmentation and quantification of subcellular structures in fluorescence microscopy images using squassh. Nature protocols 9 (3), pp. 586–596. Cited by: §1.
  • [29] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §1, §2, §3.1, §4.
  • [30] H. R. Roth, L. Lu, N. Lay, A. P. Harrison, A. Farag, A. Sohn, and R. M. Summers (2018) Spatial aggregation of holistically-nested convolutional neural networks for automated pancreas localization and segmentation. Medical image analysis 45, pp. 94–107. Cited by: §2.
  • [31] N. Sharma and L. M. Aggarwal (2010) Automated medical image segmentation techniques. Journal of medical physics/Association of Medical Physicists of India 35 (1), pp. 3. Cited by: §1.
  • [32] A. A. Shvets, A. Rakhlin, A. A. Kalinin, and V. I. Iglovikov (2018) Automatic instrument segmentation in robot-assisted surgery using deep learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624–628. Cited by: §1.
  • [33] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1.
  • [34] L. Song, J. Lin, Z. J. Wang, and H. Wang (2020) An end-to-end multi-task deep learning framework for skin lesion analysis. IEEE journal of biomedical and health informatics. Cited by: §2.
  • [35] K. Suzuki (2017) Overview of deep learning in medical imaging. Radiological physics and technology 10 (3), pp. 257–273. Cited by: §1.
  • [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §1.
  • [37] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler (2019) Gated-scnn: gated shape cnns for semantic segmentation. In ICCV, Cited by: §1.
  • [38] P. Wang, V. M. Patel, and I. Hacihaliloglu (2018) Simultaneous segmentation and classification of bone surfaces from ultrasound using a multi-feature guided cnn. In MICCAI, Cited by: §2.
  • [39] X. Xie, F. Shi, J. Niu, and X. Tang (2018) Breast ultrasound image classification and segmentation using convolutional neural networks. In Pacific rim conference on multimedia, Cited by: §2.
  • [40] R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §2.
  • [41] Y. Zhao, D. Xue, Y. Wang, R. Zhang, B. Sun, Y. Cai, H. Feng, Y. Cai, and J. Xu (2019) Computer-assisted diagnosis of early esophageal squamous cell carcinoma using narrow-band imaging magnifying endoscopy. Endoscopy 51 (04), pp. 333–341. Cited by: §1.
  • [42] Y. Zhou, H. Chen, Y. Li, Q. Liu, X. Xu, S. Wang, P. Yap, and D. Shen (2021) Multi-task learning for segmentation and classification of tumors in 3d automated breast ultrasound images. Medical Image Analysis 70, pp. 101918. Cited by: §2.