1 Introduction
Skin cancer is one of the most common type of cancers, and early diagnosis is critical for effective treatment [10]
. In recent years, computer aided diagnosis based on dermoscopy images has been widely researched to complement human assessment. Skin lesion segmentation is the task of separating lesion pixels from background. Segmentation is a nontrivial task due to the significant variance in shape, color, texture, etc. Nevertheless, segmentation remains a common precursor step for automatic diagnosis as it ensures subsequent analysis (i.e. classification) concentrates on the skin lesion itself and discards irrelevant regions.
Since the emergence of fully convolutional networks (FCN) for semantic image segmentation [5], FCNbased methods have been increasingly popular in medical image segmentation. Particularly, UNet [9] leveraged the encoderdecoder architecture and applied skipconnections to merge lowlevel and highlevel convolutional features, so that more refined details can be preserved. FCN and UNet have become the most common baseline models, on which many different proposed variants for skin lesion segmentation were based. Venkatesh et al. [14] and Ibtehaz et al. [3]
modified UNet, designing more complex residual connections within each block of the encoders and the decoders. Yuan et al.
[17] and Mirikharaji et al. [6] introduced, in order, a Jaccard distance based and starshape loss functions to refine the segmentation results of the baseline models employing crossentropy (CE) loss. Oktay et al. [7] proposed an attention gate to filter the features propagated through the skip connections of UNet.Despite the success of the aforementioned FCNbased methods, they all assume that reliable ground truth annotations are abundant, which is not always the case in practice, not only because collecting pixellevel annotation is timeconsuming, but also since humanannotations are inherently noisy. Further, annotations suffer from inter/intraobserver variation even among experts as the boundary of the lesion is often ambiguous. On the other hand, as the high capacity of deep neural networks (DNN) enable them to memorize a random labeling of training data [18], DNNs are potentially exposed to overfitting to noisy labels. Therefore, treating the annotations as completely accurate and reliable may lead to biased models with weak generalization ability. This motivates the need for constructing models that are more robust to label noise.
Previous works on learning a deep classification model from noisy labels can be categorized into two groups. Firstly, various methods were proposed to model the label noise, together with learning a discriminative neural network. For example, probabilistic graphical models were used to discover the relation between data, clean labels and noisy labels, with the clean labels treated as latent variables related to the observed noisy label [16, 12]. Sukhbaatar et al. [11] and Goldberger et al. [2] incorporated an additional layer in the network dedicated to learning the noise distribution. Veit et al. [13]
proposed a multitask network to learn a mapping from noisy to clean annotations as well as learning a classifier finetuned on the clean set and the full dataset with reduced noise.
Instead of learning the noise model, the second group of methods concentrates on reweighting the loss function. Jiang et al. [4]
utilized a long shortterm memory (LSTM) to predict sample weights given a sequence of their cost values. Wang et al.
[15] designed an iterative learning approach composed of a noisy label detection module and a discriminative feature learning module, combined with a reweighting module on the softmax loss to emphasize the learning from clean labels and reduce the influence of noisy labels. Recently, a more elaborate reweighting method based on a metalearning algorithm was proposed to assign weights to classification samples based on their gradient direction [8]. A small set of clean data is leveraged in this reweighting strategy to evaluate the noisy samples gradient direction and assign more weights to sample whose gradient is closer to that of the clean dataset.In this work, we aim to extend the idea of example reweighting explored previously for the classification problem to the task of pixellevel segmentation. We propose the first method to learn a set of spatially adaptive weight maps associated with training skin images and adjust the contribution of each pixel in the optimization of deep network. Inspired by Ren et al. [8], the importance weights are assigned to pixels based on the pixelwise loss gradient directions. A metalearning approach is integrated at every training iteration to approximate the optimal weight maps of the current batch based on the CE loss on a small set of skin lesion images annotated by experts. Learning the deep skin lesion segmentation network and spatially adaptive weight maps are performed in an endtoend manner. Our experiments show how efficient leveraging of a small clean dataset makes a deep segmentation network robust to annotation noise.
2 Methodology
Our goal is to leverage a combination of a small set of expensive expert level annotations as well as a large set of unreliable noisy annotations, acquired from, e.g., novice dermatologists or crowdsourcing platforms, into the learning of a fully convolutional segmentation network.
FCN’s average loss.
In the setting of supervised learning, with the assumption of the availability of highquality clean annotations for a large dataset of
images and their corresponding pixelwise segmentation maps, , parametersof a fully convolutional segmentation network are learned by minimizing the negative loglikelihood of the generated segmentation probability maps in the cost function
:(1) 
where is the number of pixels in an image, is the pixel space of image , and refer, in order, to the image pixel and its ground truth label, and is the predicted probability. As the same level of trust in the pixellevel annotations of this clean training data annotations is assumed, the final value of the loss function is averaged equally over all pixels of the training images.
FCN’s weighted loss. As opposed to the fully supervised setting, when the presence of noise in most training data annotations is inevitable while only a limited amount of data can be verified by human experts, our training data comprise two sets: with verified clean labels and with unverified noisy labels. We also assume that . Correspondingly, we have two losses, and . Whereas has equal weighting, penalizes a loglikelihood of the predicted pixel probabilities but weighted based on the amount of noise:
(2)  
(3) 
where is the weight associated with pixel of image . All the weights of the pixels of image are collected in a spatially adaptive weight map , and weight maps associated with all noisy training images are collected in .
Model optimization. The deep noiserobust network parameters are now found by optimizing the weighted objective function (as opposed to equal weighting in (1)) on the noisy annotated data , as follows:
(4) 
Optimal spatially adaptive weights. The optimal value of unknown parameters is achieved by minimizing the expectation of negative loglikelihoods in the metaobjective function over the clean training data :
(5) 
Efficient metatraining. Solving (5) to optimize the spatially adaptive weight maps for each update step of the network parameter in (4) is inefficient. Instead, an online metalearning approach is utilized to approximate for every gradient descent step involved in optimizing (4). At every update step of (4), we pass a minibatch of noisy data forward through the network and then compute one gradient descent step toward the minimization of :
(6) 
where is the gradient descent learning rate and in the initial spatial weight maps set to zero. Next, a minibatch of clean data is fed forwarded through the network with parameters and the gradient of with respect to the current batch weight maps is computed. We then take a single step toward the minimization of , as per (5), and pass the output to a rectifier function as follows:
(7)  
(8) 
where is a gradient descent learning rate. Following the average loss over a minibatch samples in training a deep network, we normalize the learned weight maps parameters such that .
Equations (7) and (8) clarify how the learned weight maps prevents penalizing the pixels whose gradient direction is not similar to the direction of gradient on the clean data. A negative element in (associated with pixel of image ) implies a positive gradient in (7), meaning that increasing the assigned weight to pixel , , increases the loss value on clean data. So by rectifying the values of in (8), we assign zero weights to pixel and prevent penalizing it in the loss function. In addition, the rectify function makes the loss nonnegative (cf. (3)) and results in more stable optimization.
Once the learning of spatially adaptive weight maps is performed, a final backward pass is needed to minimize the reweighted objective function and update the network parameters from to :
(9) 
3 Experiments and Discussion
Data description. We validated our spatially adaptive reweighting approach on data provided by the International Skin Imaging Collaboration (ISIC) in 2017 [1]. The data consists of training, validation and test images with their corresponding segmentation masks. The same split of validation and test data are deployed for setting the hyperparameters and reporting the final results. We resized all images to
pixels and normalized each RGB channel with the per channel mean and standard deviation of training data.
To create noisy ground truth annotations, we consider a lesion boundary as a closed polygon and simplify it by reducing its number of vertices: Less important vertices are discarded first, where the importance of each vertex is proportional to the acuteness of the angle formed by the two adjacent polygon line segments and their length. vertex, vertex and axisalignedvertex polygons are generated to represent different levels of annotation noise for our experiments. To simulate an unsupervised setting, as an extreme level of noise, we automatically generated segmentation maps that cover the whole image (excluding a thin band around the image perimeter). Fig. 1 shows an sample lesion image and its associated ground truth as well as generated noisy annotations.
Implementation.
We utilize PyTorch framework to implement the deep reweighting network. We adopt the architecture of fully convolutional network UNet
[9]initialized by a random Gaussian distribution. We use the stochastic gradient descent algorithm for learning the network parameters from scratch as well as the spatial weight maps over the minibatch of sizes
and . We set the initial learning rate for both and to and divide by when the validation performance stops improving. We set the momentum and weight decay to and , respectively. Training the deep reweighting network took three days on our GB GPU memory.Reweighting vs. Finetuning. One popular way of training a deep network when a small set of clean data as well as a large set of noisy data are available is to pretrain the network on the noisy dataset and then finetune it using the clean dataset. Considering this finetuning approach as our baseline, by learning spatially adaptive weight maps we expect to leverage clean annotations more effectively and achieve an improved performance. To study the performance of finetuning and reweighting approaches, we start with images annotated by vertex polygons and gradually replace some of the noisy annotation with expertlevel clean annotations, i.e., increase . We report the segmentation Dice score on the test set in Fig. 2. The first (leftmost) point on the finetuning curve indicates the result of UNet when all annotation are noisy and the last point corresponds to a fullysupervised UNet. When all annotation are either clean or noisy, training the reweighting network is not applicable. Comparing the finetuning and reweighting results, we observe a consistent improvement in the test Dice score when the proposed reweighting algorithm is deployed. In particular, a bigger boost in improvement when the size of the clean annotation is smaller signifies our method’s ability to effectively utilize even a handful of clean samples.
Size of the clean dataset. Fig. 2 shows the effect of the clean data size, , on the reweighting network performance. Our results show leveraging just clean annotations in the reweighting algorithm improves the test Dice score by in comparison to training UNet on all noisy annotations. Also, utilizing clean annotations in reweighting algorithm achieves a test Dice score (80%) almost equal to that of the fully supervised approach. With only 100 clean image annotations, the reweighting method outperforms the fullysupervised with 2000 clean annotations. Incrementing from 50 to 1990, the reweighting approach improves the test Dice score by about , questioning whether a increase in accuracy is worth the 40fold increase in annotation effort. Outperforming the supervised setting using reweighting algorithm suggests that the adaptive loss reweighting strategy works like a regularizer and improves the generalization ability of the deep network.
Robustness to noise. In our next experiment, we examine how the level of noise in the training data affect the performance of the reweighting deep network in comparison to finetuning. We utilized four sets of (i) vertex; (ii) vertex; (iii) axisalignedvertex simplified polygons as segmentation maps; and (iv) unsupervised coarse segmentation masks where each set corresponds to a level of annotation noise (Fig. 1). Setting and , the segmentation Dice score of test images for reweighting and finetuning approaches are reported in Table 1. We observe that deploying the proposed reweighting algorithm for 3vertex annotations outperforms learning from accurate delineation without reweighting. Also, increasing the level of noise, from 7vertex to 3vertex polygon masks in noisy data, results in just 1% Dice score drop when deploying reweighting compared to 3% drop in finetuning.
noise type  finetuning  proposed reweighting  

A  no noise (fullysupervised)  78.63%  not applicable 
B  vertex  76.12%  80.72% 
C  axisalignedvertex  75.04%  80.29% 
D  vertex  73.02%  79.45% 
E  maximal (unsupervised)  70.45%  73.55% 
Qualitative results. To examine the spatially adaptive weights more closely, for some sample images, we overlay the learned weight maps, in training iterations 1K and 100K, over the incorrectly annotated pixels mask (Fig. 3). To avoid overfitting to annotation noise, we expect the metalearning step to assign zero weights to noisy pixels (the white pixels in Fig. 3(d)). Looking into Fig. 3(e,f) confirms that the model consistently learns to assign zero (or very close to zero) weights to noisy annotated pixels (cyan pixels), which ultimatly results in the prediction of the segmentation maps in Fig. 3(c) that, qualitatively, closely resemble the unseen expert delineated contours shown in Fig. 3(a).
4 Conclusion
By learning a spatiallyadaptive map to perform pixelwise weighting of a segmentation loss, we were able to effectively leverage a limited amount of cleanly annotated data in training a deep segmentation network that is robust to annotation noise. We demonstrated, on a skin lesion image dataset, that our method can greatly reduce the requirement for careful labelling of images without sacrificing segmentation accuracy. Our reweighting segmentation network is trained endtoend, can be combined with any segmentation network architecture, and does not require any additional hyperparameter tuning.
Acknowledgments. We thank NVIDIA Corporation for the donation of Titan X GPUs used in this research and Compute Canada for partial funding.
References
 [1] Codella et al.: Skin lesion analysis toward melanoma detection: A challenge at the 2017 ISBI. arXiv:1710.05006 (2017)
 [2] Goldberger, J., BenReuven, E.: Training deep neuralnetworks using a noise adaptation layer. In: ICLR (2017)
 [3] Ibtehaz et al.: Multiresunet: Rethinking the UNet architecture for multimodal biomedical image segmentation. arXiv preprint arXiv:1902.04049 (2019)
 [4] Jiang et al.: Mentornet: Regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055 4 (2017)
 [5] Long et al.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR. pp. 3431–3440 (2015)
 [6] Mirikharaji, Z., Hamarneh, G.: Star shape prior in fully convolutional networks for skin lesion segmentation. In: MICCAI. pp. 737–745 (2018)
 [7] Oktay et al.: Attention UNet: learning where to look for the pancreas. In: MIDL (2018)

[8]
Ren et al.: Learning to reweight examples for robust deep learning. In: ICML (2018)
 [9] Ronneberger et al.: UNet: Convolutional networks for biomedical image segmentation. In: MICCAI. pp. 234–241 (2015)
 [10] Siegel et al.: Cancer statistics. CA: a cancer journal for clinicians 67(1), 7–30 (2017)
 [11] Sukhbaatar et al.: Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080 (2014)
 [12] Vahdat, A.: Toward robustness against label noise in training deep discriminative neural networks. In: NIPS. pp. 5596–5605 (2017)
 [13] Veit et al.: Learning from noisy largescale datasets with minimal supervision. In: IEEE CVPR. pp. 839–847 (2017)
 [14] Venkatesh et al.: A deep residual architecture for skin lesion segmentation. In: OR 2.0 ContextAware Operating Theaters, Computer Assisted Robotic Endoscopy, Clinical ImageBased Procedures, and Skin Image Analysis, pp. 277–284 (2018)
 [15] Wang et al.: Iterative learning with openset noisy labels. In: IEEE CVPR. pp. 8688–8696 (2018)
 [16] Xiao et al.: Learning from massive noisy labeled data for image classification. In: IEEE CVPR. pp. 2691–2699 (2015)
 [17] Yuan et al.: Automatic skin lesion segmentation using deep fully convolutional networks with jaccard distance. IEEE TMI 36(9), 1876–1886 (2017)
 [18] Zhang et al.: Understanding deep learning requires rethinking generalization. In: ICLR (2017)