Learning to Segment Skin Lesions from Noisy Annotations

06/10/2019 ∙ by Zahra Mirikharaji, et al. ∙ Simon Fraser University 0

Deep convolutional neural networks have driven substantial advancements in the automatic understanding of images. Requiring a large collection of images and their associated annotations is one of the main bottlenecks limiting the adoption of deep networks. In the task of medical image segmentation, requiring pixel level semantic annotations performed by human experts exacerbate this difficulty. This paper proposes a new framework to train a fully convolutional segmentation network from a large set of cheap unreliable annotations and a small set of expert level clean annotations. We propose a spatially adaptive reweighting approach to treat clean and noisy pixel-level annotations commensurately in the loss function. We deploy a meta-learning approach to assign higher importance to pixels whose loss gradient direction is closer to those of clean data. Our experiments on training the network using segmentation ground truth corrupted with different levels of annotation noise show how spatial reweighting improves the robustness of deep networks to noisy annotations.



There are no comments yet.


page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Skin cancer is one of the most common type of cancers, and early diagnosis is critical for effective treatment [10]

. In recent years, computer aided diagnosis based on dermoscopy images has been widely researched to complement human assessment. Skin lesion segmentation is the task of separating lesion pixels from background. Segmentation is a nontrivial task due to the significant variance in shape, color, texture, etc. Nevertheless, segmentation remains a common precursor step for automatic diagnosis as it ensures subsequent analysis (i.e. classification) concentrates on the skin lesion itself and discards irrelevant regions.

Since the emergence of fully convolutional networks (FCN) for semantic image segmentation [5], FCN-based methods have been increasingly popular in medical image segmentation. Particularly, U-Net [9] leveraged the encoder-decoder architecture and applied skip-connections to merge low-level and high-level convolutional features, so that more refined details can be preserved. FCN and U-Net have become the most common baseline models, on which many different proposed variants for skin lesion segmentation were based. Venkatesh et al. [14] and Ibtehaz et al. [3]

modified U-Net, designing more complex residual connections within each block of the encoders and the decoders. Yuan et al. 

[17] and Mirikharaji et al. [6] introduced, in order, a Jaccard distance based and star-shape loss functions to refine the segmentation results of the baseline models employing cross-entropy (CE) loss. Oktay et al. [7] proposed an attention gate to filter the features propagated through the skip connections of U-Net.

Despite the success of the aforementioned FCN-based methods, they all assume that reliable ground truth annotations are abundant, which is not always the case in practice, not only because collecting pixel-level annotation is time-consuming, but also since human-annotations are inherently noisy. Further, annotations suffer from inter/intra-observer variation even among experts as the boundary of the lesion is often ambiguous. On the other hand, as the high capacity of deep neural networks (DNN) enable them to memorize a random labeling of training data [18], DNNs are potentially exposed to overfitting to noisy labels. Therefore, treating the annotations as completely accurate and reliable may lead to biased models with weak generalization ability. This motivates the need for constructing models that are more robust to label noise.

Previous works on learning a deep classification model from noisy labels can be categorized into two groups. Firstly, various methods were proposed to model the label noise, together with learning a discriminative neural network. For example, probabilistic graphical models were used to discover the relation between data, clean labels and noisy labels, with the clean labels treated as latent variables related to the observed noisy label [16, 12]. Sukhbaatar et al. [11] and Goldberger et al. [2] incorporated an additional layer in the network dedicated to learning the noise distribution. Veit et al. [13]

proposed a multi-task network to learn a mapping from noisy to clean annotations as well as learning a classifier fine-tuned on the clean set and the full dataset with reduced noise.

Instead of learning the noise model, the second group of methods concentrates on reweighting the loss function. Jiang et al. [4]

utilized a long short-term memory (LSTM) to predict sample weights given a sequence of their cost values. Wang et al. 

[15] designed an iterative learning approach composed of a noisy label detection module and a discriminative feature learning module, combined with a reweighting module on the softmax loss to emphasize the learning from clean labels and reduce the influence of noisy labels. Recently, a more elaborate reweighting method based on a meta-learning algorithm was proposed to assign weights to classification samples based on their gradient direction [8]. A small set of clean data is leveraged in this reweighting strategy to evaluate the noisy samples gradient direction and assign more weights to sample whose gradient is closer to that of the clean dataset.

In this work, we aim to extend the idea of example reweighting explored previously for the classification problem to the task of pixel-level segmentation. We propose the first method to learn a set of spatially adaptive weight maps associated with training skin images and adjust the contribution of each pixel in the optimization of deep network. Inspired by Ren et al. [8], the importance weights are assigned to pixels based on the pixel-wise loss gradient directions. A meta-learning approach is integrated at every training iteration to approximate the optimal weight maps of the current batch based on the CE loss on a small set of skin lesion images annotated by experts. Learning the deep skin lesion segmentation network and spatially adaptive weight maps are performed in an end-to-end manner. Our experiments show how efficient leveraging of a small clean dataset makes a deep segmentation network robust to annotation noise.

2 Methodology

Our goal is to leverage a combination of a small set of expensive expert level annotations as well as a large set of unreliable noisy annotations, acquired from, e.g., novice dermatologists or crowdsourcing platforms, into the learning of a fully convolutional segmentation network.

FCN’s average loss.

  In the setting of supervised learning, with the assumption of the availability of high-quality clean annotations for a large dataset of

images and their corresponding pixel-wise segmentation maps, , parameters

of a fully convolutional segmentation network are learned by minimizing the negative log-likelihood of the generated segmentation probability maps in the cost function



where is the number of pixels in an image, is the pixel space of image , and refer, in order, to the image pixel and its ground truth label, and is the predicted probability. As the same level of trust in the pixel-level annotations of this clean training data annotations is assumed, the final value of the loss function is averaged equally over all pixels of the training images.

FCN’s weighted loss.  As opposed to the fully supervised setting, when the presence of noise in most training data annotations is inevitable while only a limited amount of data can be verified by human experts, our training data comprise two sets: with verified clean labels and with unverified noisy labels. We also assume that . Correspondingly, we have two losses, and . Whereas has equal weighting, penalizes a log-likelihood of the predicted pixel probabilities but weighted based on the amount of noise:


where is the weight associated with pixel of image . All the weights of the pixels of image are collected in a spatially adaptive weight map , and weight maps associated with all noisy training images are collected in .

Model optimization.  The deep noise-robust network parameters are now found by optimizing the weighted objective function (as opposed to equal weighting in (1)) on the noisy annotated data , as follows:


Optimal spatially adaptive weights.   The optimal value of unknown parameters is achieved by minimizing the expectation of negative log-likelihoods in the meta-objective function over the clean training data :


Efficient meta-training.  Solving (5) to optimize the spatially adaptive weight maps for each update step of the network parameter in (4) is inefficient. Instead, an online meta-learning approach is utilized to approximate for every gradient descent step involved in optimizing (4). At every update step of (4), we pass a mini-batch of noisy data forward through the network and then compute one gradient descent step toward the minimization of :


where is the gradient descent learning rate and in the initial spatial weight maps set to zero. Next, a mini-batch of clean data is fed forwarded through the network with parameters and the gradient of with respect to the current batch weight maps is computed. We then take a single step toward the minimization of , as per (5), and pass the output to a rectifier function as follows:


where is a gradient descent learning rate. Following the average loss over a mini-batch samples in training a deep network, we normalize the learned weight maps parameters such that .

Equations (7) and (8) clarify how the learned weight maps prevents penalizing the pixels whose gradient direction is not similar to the direction of gradient on the clean data. A negative element in (associated with pixel of image ) implies a positive gradient in (7), meaning that increasing the assigned weight to pixel , , increases the loss value on clean data. So by rectifying the values of in (8), we assign zero weights to pixel and prevent penalizing it in the loss function. In addition, the rectify function makes the loss non-negative (cf. (3)) and results in more stable optimization.

Once the learning of spatially adaptive weight maps is performed, a final backward pass is needed to minimize the reweighted objective function and update the network parameters from to :


3 Experiments and Discussion

Data description.  We validated our spatially adaptive reweighting approach on data provided by the International Skin Imaging Collaboration (ISIC) in 2017 [1]. The data consists of training, validation and test images with their corresponding segmentation masks. The same split of validation and test data are deployed for setting the hyper-parameters and reporting the final results. We re-sized all images to

pixels and normalized each RGB channel with the per channel mean and standard deviation of training data.

To create noisy ground truth annotations, we consider a lesion boundary as a closed polygon and simplify it by reducing its number of vertices: Less important vertices are discarded first, where the importance of each vertex is proportional to the acuteness of the angle formed by the two adjacent polygon line segments and their length. -vertex, -vertex and -axis-aligned-vertex polygons are generated to represent different levels of annotation noise for our experiments. To simulate an unsupervised setting, as an extreme level of noise, we automatically generated segmentation maps that cover the whole image (excluding a thin band around the image perimeter). Fig. 1 shows an sample lesion image and its associated ground truth as well as generated noisy annotations.

Figure 1: A skin image and its clean and various noisy segmentation maps.


  We utilize PyTorch framework to implement the deep reweighting network. We adopt the architecture of fully convolutional network U-Net 


initialized by a random Gaussian distribution. We use the stochastic gradient descent algorithm for learning the network parameters from scratch as well as the spatial weight maps over the mini-batch of sizes

and . We set the initial learning rate for both and to and divide by when the validation performance stops improving. We set the momentum and weight decay to and , respectively. Training the deep reweighting network took three days on our GB GPU memory.

Reweighting vs. Fine-tuning.  One popular way of training a deep network when a small set of clean data as well as a large set of noisy data are available is to pre-train the network on the noisy dataset and then fine-tune it using the clean dataset. Considering this fine-tuning approach as our baseline, by learning spatially adaptive weight maps we expect to leverage clean annotations more effectively and achieve an improved performance. To study the performance of fine-tuning and reweighting approaches, we start with images annotated by -vertex polygons and gradually replace some of the noisy annotation with expert-level clean annotations, i.e., increase . We report the segmentation Dice score on the test set in Fig. 2. The first (leftmost) point on the fine-tuning curve indicates the result of U-Net when all annotation are noisy and the last point corresponds to a fully-supervised U-Net. When all annotation are either clean or noisy, training the reweighting network is not applicable. Comparing the fine-tuning and reweighting results, we observe a consistent improvement in the test Dice score when the proposed reweighting algorithm is deployed. In particular, a bigger boost in improvement when the size of the clean annotation is smaller signifies our method’s ability to effectively utilize even a handful of clean samples.

Figure 2: Test Dice score comparison for fine-tuning and reweighting models.

Size of the clean dataset.  Fig. 2 shows the effect of the clean data size, , on the reweighting network performance. Our results show leveraging just clean annotations in the reweighting algorithm improves the test Dice score by in comparison to training U-Net on all noisy annotations. Also, utilizing clean annotations in reweighting algorithm achieves a test Dice score (80%) almost equal to that of the fully supervised approach. With only 100 clean image annotations, the reweighting method outperforms the fully-supervised with 2000 clean annotations. Incrementing from 50 to 1990, the reweighting approach improves the test Dice score by about , questioning whether a increase in accuracy is worth the 40-fold increase in annotation effort. Outperforming the supervised setting using reweighting algorithm suggests that the adaptive loss reweighting strategy works like a regularizer and improves the generalization ability of the deep network.

Robustness to noise.  In our next experiment, we examine how the level of noise in the training data affect the performance of the reweighting deep network in comparison to fine-tuning. We utilized four sets of (i) -vertex; (ii) -vertex; (iii) -axis-aligned-vertex simplified polygons as segmentation maps; and (iv) unsupervised coarse segmentation masks where each set corresponds to a level of annotation noise (Fig. 1). Setting and , the segmentation Dice score of test images for reweighting and fine-tuning approaches are reported in Table 1. We observe that deploying the proposed reweighting algorithm for 3-vertex annotations outperforms learning from accurate delineation without reweighting. Also, increasing the level of noise, from 7-vertex to 3-vertex polygon masks in noisy data, results in just 1% Dice score drop when deploying reweighting compared to 3% drop in fine-tuning.

noise type fine-tuning proposed reweighting
 A no noise (fully-supervised) 78.63% not applicable
 B -vertex 76.12% 80.72%
 C -axis-aligned-vertex 75.04% 80.29%
 D -vertex 73.02% 79.45%
 E maximal (unsupervised) 70.45% 73.55%
Table 1: Dice score using fine-tuning and reweighting methods for various noise levels.

Qualitative results.  To examine the spatially adaptive weights more closely, for some sample images, we overlay the learned weight maps, in training iterations 1K and 100K, over the incorrectly annotated pixels mask (Fig. 3). To avoid overfitting to annotation noise, we expect the meta-learning step to assign zero weights to noisy pixels (the white pixels in Fig. 3-(d)). Looking into  Fig. 3-(e,f) confirms that the model consistently learns to assign zero (or very close to zero) weights to noisy annotated pixels (cyan pixels), which ultimatly results in the prediction of the segmentation maps in Fig. 3-(c) that, qualitatively, closely resemble the unseen expert delineated contours shown in Fig. 3-(a).

Figure 3: (a) Sample skin images and expert lesion delineations (thin black contour), (b) noisy ground truth, (c) network output, (d) the erroneously labelled pixels (i.e. noisy pixels) and learned weight maps in iterations (e) 1K and (f) 100K overlaid over the noisy pixel masks using the following coloring scheme: Noisy pixels are rendered via the blue channel: mislabelled pixels are blue, and weights via the green channel: the lower the weight the greener the rendering. The cyan color is produced when mixing green and blue, i.e. when low weights (green) are assigned to mislabelled pixels (blue). Note how the cyan very closely matches (d), i.e. mislabelled pixels are ca. null-weighted.

4 Conclusion

By learning a spatially-adaptive map to perform pixel-wise weighting of a segmentation loss, we were able to effectively leverage a limited amount of cleanly annotated data in training a deep segmentation network that is robust to annotation noise. We demonstrated, on a skin lesion image dataset, that our method can greatly reduce the requirement for careful labelling of images without sacrificing segmentation accuracy. Our reweighting segmentation network is trained end-to-end, can be combined with any segmentation network architecture, and does not require any additional hyper-parameter tuning.

Acknowledgments. We thank NVIDIA Corporation for the donation of Titan X GPUs used in this research and Compute Canada for partial funding.