Log In Sign Up

Weakly Supervised Segmentation with Multi-scale Adversarial Attention Gates

Large, fine-grained image segmentation datasets, annotated at pixel-level, are difficult to obtain, particularly in medical imaging, where annotations also require expert knowledge. Weakly-supervised learning can train models by relying on weaker forms of annotation, such as scribbles. Here, we learn to segment using scribble annotations in an adversarial game. With unpaired segmentation masks, we train a multi-scale GAN to generate realistic segmentation masks at multiple resolutions, while we use scribbles to learn the correct position in the image. Central to the model's success is a novel attention gating mechanism, which we condition with adversarial signals to act as a shape prior, resulting in better object localization at multiple scales. We evaluated our model on several medical (ACDC, LVSC, CHAOS) and non-medical (PPSS) datasets, and we report performance levels matching those achieved by models trained with fully annotated segmentation masks. We also demonstrate extensions in a variety of settings: semi-supervised learning; combining multiple scribble sources (a crowdsourcing scenario) and multi-task learning (combining scribble and mask supervision). We will release expert-made scribble annotations for the ACDC dataset, and the code used for the experiments, at


page 1

page 6

page 8


Self-supervised Multi-scale Consistency for Weakly Supervised Segmentation Learning

Collecting large-scale medical datasets with fine-grained annotations is...

ACCL: Adversarial constrained-CNN loss for weakly supervised medical image segmentation

We propose adversarial constrained-CNN loss, a new paradigm of constrain...

CheXseg: Combining Expert Annotations with DNN-generated Saliency Maps for X-ray Segmentation

Medical image segmentation models are typically supervised by expert ann...

Multi-organ Segmentation via Co-training Weight-averaged Models from Few-organ Datasets

Multi-organ segmentation has extensive applications in many clinical app...

Fully and Weakly Supervised Referring Expression Segmentation with End-to-End Learning

Referring Expression Segmentation (RES), which is aimed at localizing an...

Constrained Deep Weak Supervision for Histopathology Image Segmentation

In this paper, we develop a new weakly-supervised learning algorithm to ...

GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation

Segmenting an image into its parts is a frequent preprocess for high-lev...

Code Repositories



view repo


Code for the paper: "Weakly Supervised Segmentation with Multi-scale Adversarial Attention Gates"

view repo

I Introduction

Convolutional Neural Networks (CNNs) have obtained impressive results in computer vision. However, their ability to generalize on new examples is strongly dependent on the amount of training data, thus limiting their applicability when annotations are scarce. As a result, considerable effort has been placed in exploiting semi-supervised and weakly-supervised strategies. For semantic segmentation, semi-supervised learning (SSL) aims to use unlabeled images, which are generally easier to collect, together with some fully annotated image-segmentation pairs [chapelle2009semi, cheplygina2019not]. However, the information inside the unlabeled data can improve CNNs only under specific assumptions [chapelle2009semi], and SSL still requires representative image-segmentation pairs being available.

Alternatively, weakly-supervised approaches [zhou2019prior, khoreva2017simple, can2018learning, souly2017semi] attempt to train models relying only on weak annotations (e.g., image-level labels, sparse pixel annotations, or noisy annotations [tajbakhsh2020embracing]), that should be considerably easier to obtain. Thus, building large-scale annotated datasets becomes feasible and the generalization capability of the model per annotation effort can dramatically increase: e.g., 15 times more bounding boxes can be annotated within the same time compared to segmentation masks [lin2014microsoft]. Among weak annotations, scribbles are of particular interest for medical image segmentation, because they are easier to generate and well suited for annotating nested structures [can2018learning]. Unfortunately, learning from weak annotations does not provide as strong supervisory signal as that obtained from fine-grained per-pixel segmentation masks, and training the models is harder. For this reason, improved training strategies would enable remarkable gains in training models with weaker annotations.

Figure 1: In an adversarial game, our model learns to generate segmentation masks that look realistic at multiple scales and overlap with the available scribble annotations. Loopy arrows in the figure, on the segmentor, represent the proposed attention gates, which under adversarial conditioning suppress irrelevant information in the extracted features maps.

I-a Overview of the proposed approach

In this paper, we introduce a novel training strategy in the context of weakly supervised learning for multi-part segmentation. We train a model for semantic segmentation using scribbles, shaping the training procedure as an adversarial game [goodfellow2014generative] between a conditional mask generator (the segmentor) and a discriminator. We obtain segmentation performance comparable to when training the segmentor with full segmentation masks. We demonstrate this for the segmentation of the heart, abdominal organs, and human pose parts.

Our uniqueness is that we use adversarial feedback at all scales, coupling the generator with a multi-scale discriminator. But, differently from other multi-scale GANs [denton2015deep, karras2017progressive], our generator includes customized attention gates, i.e. modules that automatically produce soft region proposals in the feature maps, highlighting the salient information inside of them. Differently from the attention gates presented in [schlemper2019attention] ours are conditioned by adversarial signals, which enforce stronger object localization in the image.

The discriminator, acting as a learned shape prior, is trained on a set of segmentation masks, obtained from a different data source111We simulate a realistic clinical setting, e.g., mixing MR datasets acquired with same plane on different scanners, or using different acquisition protocols. and is thus unpaired. We drive the segmentor to generate accurate segmentations from the input images, while satisfying the multi-scale shape prior learned by the discriminator. We encourage a tight multi-level interaction between segmentor and discriminator introducing Adversarial Attention Gating, an effective attention strategy that, subject to adversarial conditioning, i) encourages the segmentor to predict masks satisfying multi-resolution shape priors; and ii) forces the segmentor to train deeper layers better. Finally, we also penalize the segmentor when it predicts segmentations that do not overlap with the available scribbles, pushing it to learn the correct mapping from images to label maps.

We summarize the contributions of the paper222We release our code and the expert-made scribble annotations for ACDC data at: as follows:

  • We use scribble annotations to learn semantic segmentation during a multi-scale adversarial game.

  • We introduce Adversarial Attention Gates (AAGs): effective prior-driven attention gates that force the segmentor to localize objects in the image. Subject to adversarial gradients, AAGs also encourage a better training of deeper layers in the segmentor.

  • We obtain state-of-the-art performance on several popular medical datasets (ACDC [bernard2018deep], LVSC [suinesiaputra2014collaborative] and CHAOS [chaos]) and computer vision data (PPSS [luo2013pedestrian]).

  • We investigate diverse learning scenarios, such as: learning from different extents of weak annotations (i.e., semi-supervised learning); learning from multiple scribbles per image (and thus simulating a crowdsourcing setting); and finally learning also with few strong supervision pairs of segmentation masks and images (i.e., multi-task learning).

Ii Related Work

A large body of research aimed at developing learning algorithms that rely less on high-quality annotations [cheplygina2019not, tajbakhsh2020embracing]. Below, we briefly review recent weakly supervised methods that use scribbles to learn image segmentation. Then, we discuss what are the advantages of our adversarial setup compared to other multi-scale GANs. Finally, we discuss the difference between the attention gates that are an integral part of our segmentor and other canonical attention modules.

Ii-a Learning from Scribbles

Scribbles are sparse annotations that have been successfully used for semantic segmentation, reporting near full-supervision accuracy in computer vision and medical image analysis. However, scribbles lack information on the object structure, and they are limited by the uncertainty of unlabelled pixels, which makes training CNNs harder, especially in boundary regions [lin2016scribblesup]. For this reason, many approaches have tried to expand scribble annotations by assigning the same class to pixels with similar intensity and nearby position [lin2016scribblesup, ji2019scribble]. At first, these approaches relabel the training set propagating annotations from the scribbles to the adjacent pixels using graph-based methods. Then, they train a CNN on the new label maps. A recent variant has been introduced by Can et al. [can2018learning]

, who suggest estimating the class of unlabelled pixels via a learned two-step procedure. At first, they train a CNN directly with scribbles; then, they relabel the training set by refining the CNN predictions with Conditional Random Fields (CRF); finally, they retrain the CNN on the new annotations.

The major limitation of the aforementioned approaches is relying on dataset relabeling, which can be time-consuming and is prone to errors that can be propagated to the models during training. Thus, many authors [can2018learning, tang2018regularized] have investigated alternatives that avoid this step, post-processing the model predictions with CRF [chen2017deeplab] or introducing CRF as a trainable layer [zheng2015conditional]. Tang et al. [tang2018regularized] have also demonstrated the possibility to substitute the CRF-based refining step, directly training a segmentor with a CRF-based loss regularizer.

Similarly, here we propose a method that avoids the data relabeling step. We train our model to directly learn a mapping from images to segmentation masks, and we remove expensive CRF-based post-processing. We cope with unlabelled regions of the image introducing a multi-scale adversarial loss333In semantic segmentation, there has been considerable interest in introducing shape priors to improve the training of CNNs [clough2019topological, dalca2018anatomical, kervadec2019constrained, kuo2019shapemask, chartsias2019disentangled]. For brevity, we won’t discuss these methods here. In this paper, we will focus on a particular kind of shape prior, learned by a multi-scale GAN. which, differently from the loss introduced by Tang et al. [tang2018regularized], does not rely on CRF, and can handle both long-range and short-range inconsistencies in the predicted masks.

Ii-B Multi-scale GANs

Multi-scale GANs have been used in many generative tasks [denton2015deep, karras2017progressive, karnewar2020msg] for their ability to learn global and local dependencies inside an image. Herein, we use the generator as a segmentor, which we train to generate realistic segmentation masks at multiple resolutions. Recently, other methods have introduced multi-scale adversarial losses for semantic segmentation. For example, Xue et al. [xue2018segan] propose to use the discriminator as a critic, measuring the -distance between real and fake inputs in features space, at multiple resolution levels. In particular, pairs of real and fake inputs consist in the Hadamard product between an image and the associated ground truth or predicted segmentation mask, respectively. Also Luo et al. [luo2018macro] have tried to distinguish real from fake input pairs at multiple scales, suggesting to use two separate discriminators (one working at high, one at low resolution) to distinguish the image concatenation with the associated ground truth or predicted segmentation, respectively.

Unfortunately, these approaches rely on image-segmentation pairs to train the discriminator. Thus, training the segmentor with unlabelled, or weakly annotated data is not possible. Instead, we train a discriminator using only masks, making the model suitable for semi- and weakly-supervised learning.

Finally, while previous approaches use multi-scale GANs with strong annotations, this is, to the best of our knowledge, the first work to explore their use in weakly-supervised learning. Furthermore, we alter the canonical interplay between discriminator and segmentor to improve the object localization in the image, that we obtain with a novel adversarial conditioning of the attention maps learned by the segmentor.

Ii-C Attention Gates

Due to the ability to suppress irrelevant and ambiguous information, attention gates have become an integral part of many sequence modeling [vaswani2017attention] and image classification [Jetley2018] frameworks. Recently, they have also been successfully employed in semantic segmentation [schlemper2019attention, oktay2018attention, wang2018deep, sinha2020multi], along with the claim that gating helps to localize objects in the image. However, standard approaches don’t incorporate any explicit constraint in the learned attention maps, which are generally predicted by the neural network autonomously. On the contrary, we show that conditioning the attention maps to be semantic, i.e., able to localize and distinguish separate objects in the image, considerably boosts the segmentation performance. Herein, we introduce a novel attention module named Adversarial Attention Gate (AAG), whose learning is conditioned by a discriminator.

Iii Proposed Approach

In this section, we first describe the adopted notation, and then we present a general overview of the proposed method. Finally, we detail model architectures and training objectives.


For the remainder, we will use italic lowercase letters to denote scalars s. Two-dimensional images (matrices) will be denoted with bold lowercase letters, as , where

are scalars denoting dimensions. Tensors

are denoted as uppercase letters, where . Finally, capital Greek letters will denote functions .

We will assume a weakly supervised setting, where we have access to: i) image-scribble pairs , being the image and the associated scribble; ii) unlabelled images; and iii) a set of segmentation masks unrelated to any of the images.444In Section V-D, we will also investigate a mixed setting, where we additionally have: iv) pairs of image-segmentation masks .

Figure 2: Model architectures. Top: segmentor and discriminator interact at multiple scales. Bottom: implementation details of the convolutional blocks. The yellow background in the bottom left indicates the Adversarial Attention Gate (AAG).

Iii-a Method Overview

We formulate the training of a CNN with weak supervision (i.e., scribbles) as an adversarial game. Particularly, we use an adversarial discriminator to learn a multi-resolution shape prior, and we enforce a mask generator, or segmentor, to satisfy it, supported by the purposely designed adversarial attention gates. Critically, AAGs localize the objects to segment at multiple resolution levels and suppress noisy activations in the remaining parts of the image (see Fig. 2).

In detail, we jointly train a multi-scale segmentor and a multi-scale adversarial discriminator . is supervisedly trained to predict segmentation masks that overlap with the scribble annotations, when available. Meanwhile, learns to distinguish real segmentation masks from those (fake) predicted by the segmentor (i.e., vs. [goodfellow2014generative], at multiple scales. We model both and as CNNs.

Iii-B Architectures


We modify a UNet [ronneberger2015u] to include AAG modules in the decoder and to allow collaborative training between segmentor and discriminator at multiple scales (Fig. 2). We leave the UNet encoder as in the original framework, allowing to extract feature maps at multiple depth levels and propagate them to the decoder via skip connections and concatenation [ronneberger2015u]. Instead, we alter the decoder such that, for every depth level

, after the two convolutional layers, an AAG first produces an attention map as the probabilistic prediction of a classifier (detailed below), then uses it to filter out activations from the input features map. Particularly, we use convolutional layers with

filters, being k the number of input channels, and produce the features map . Then, the AAG classifier uses to predict a segmentation at the given resolution level . As a classifier, we use a convolutional layer with c filters (where c is the number of possible classes, including the background). We do not apply any argmax operation on its prediction, while we use a pixel-wise softmax

to give a probabilistic interpretation of the output: as a result, every pixel is associated to a probability of belonging to every considered class, which is important to have smoother gradients on the learned attention maps. We then slice the predicted array removing the channel associated to the background, and we use the multi-channel soft segmentation: i) as input to the discriminator at the same depth level; and ii) to produce an attention map, obtained by summing up the remaining channels into a 2D probabilistic map

, localizing object positions in the image (Fig. 2). To force the segmentor to use , we multiply the extracted features with

using the Hadamard product (gating process). The resulting features maps are upsampled to the next resolution level via a nearest-neighbor interpolation. After each convolutional layer, we use batch normalization 

[ioffe2015batch] and ReLUactivation function.


We design an encoding architecture receiving real or fake inputs at multiple scales. This allows a multi-level interaction between and , and the direct propagation of adversarial gradients into the AAGs. We refer to this multi-level interaction as Adversarial Deep Supervision (ADS), as it regularizes the output of AAG classifiers similarly to deep supervision, but using adversarial gradients.

The real samples consist of expert-made segmentations, that we supply at full or downsampled resolution at multiple discriminator depths, while fake samples are the multi-scale predictions of the segmentor. In both cases, the lower-resolution inputs () are supplied to the discriminator by simply concatenating them to the features maps it extracts at each depth (Fig. 2, right).

The discriminator is a convolutional encoder, and it is adapted from [chartsias2019disentangled]. At every depth level, at first, we process and downsample the features maps using a convolutional layer with

kernels and stride of 2. The number of filters follows that of the segmentor encoder (e.g. 32, 64, 128, 256, 512). We also use spectral normalization 

[miyato2018spectral] to improve the layer training. Obtained feature maps are then compressed with a second convolutional layer using 12 filters. Both layers use tanh activations.

To improve the learning process and avoid overfitting, we make the adversarial game harder for the discriminator, using label noise [salimans2016improved] and instance noise [sonderby2016amortised]. In particular, we obtain label noise by a random flip of the discriminator labels (real vs. fake

) with a 10% probability, while we apply instance noise as a Gaussian noise with zero mean and standard deviation of 0.2, that we add to the highest resolution input.

Lastly, we compute the final prediction of the discriminator using a fully connected layer with scalar output (, ).

Iii-C Loss Functions and Training Details

We train the model minimizing supervised and adversarial objectives. In particular, we consider both contributions when scribble annotations are available for the input image, only the latter when we are using unlabeled data.

Supervised Cost

When scribbles are available, we train the segmentor to minimize a pixel-wise classification cost on the annotated pixels of the image-scribble pair , while, most importantly, we don’t propagate any loss gradient trough the unlabeled pixels. Crucially, we use the pixel-wise cross-entropy because it is shape-independent, and, to resolve the class imbalance problem, we multiply the per-class loss contribution by a scaling factor that accounts for the class cardinality. We can write the supervised cost as:


where refers to each class and c is the number of classes. We choose the class scaling factor , being the number of pixels with label i within , and the total number of annotated pixels. To avoid loss contribution on unlabeled pixels, we multiply the result by the masking function , which returns 1 for annotated pixels, 0 otherwise. A similar formulation was suggested in [tang2018normalized] termed as Partial Cross-Entropy (PCE) loss but without the class balancing. Thus, we term our formulation as Weighted-PCE (WPCE).

Adversarial Cost

Adversarial objectives are the result of a minimax game [goodfellow2014generative] between segmentor and discriminator, where is trained to maximize its capability of differentiating between real and generated segmentations, to predict segmentation masks that are good enough to trick the discriminator and minimize its performance.

To address the difficulties of training GANs, that can lead to training instability [mao2018effectiveness], we adopt the Least Square GAN objective [mao2018effectiveness] which penalizes prediction errors of the discriminator based on their distances from the decision boundary.

Given an image and an unpaired segmentation mask , we optimize discriminator and segmentor according to: and , where:


Training Strategy

We iterate the training of the model over two steps: i) optimization over a batch of weakly annotated images, and ii) optimization over a batch of unlabeled images.

When scribble annotations are available, we minimize . Crucially, we maintain a fixed ratio between the amplitude of supervised and adversarial contributions throughout the entire training process, preventing one factor to prevail over the other. We achieve this using a dynamic value for .

When dealing with a batch of images with no annotation, we alternately optimize the model: first minimizing the discriminator loss , then the generator loss .

We give more importance to the supervised objective rather than the adversarial loss because the discriminator only evaluates if the predicted masks look realistic, while it does not say anything about their accuracy. Besides, the supervised cost requires the segmentor to learn the correct mapping from images to segmentation masks, which is what we are interested into. Thus, we scale the adversarial contribution to be one order of magnitude smaller, setting for training with weak supervision. Similarly, we use to train generator and discriminator equally on the unlabeled data.

We minimize the loss function using Adam 

[kingma2014adam] and a batch size of 12. Most importantly, we find that learning from a limited number of weakly annotated images can easily trap the model in sharp, bad, local minima since the training data poorly represents the actual data distribution. Thus, we promote the search of flat and more generalizable solutions using a cyclical learning rate [smith2017cyclical]

with a period of 20 epochs, that we oscillate between

and . Similarly to previous work with weak annotations [lin2016scribblesup, dai2015boxsup], we train the model until an early stopping criterion is met, and we arrest the training when the loss between predicted and real segmentations stops decreasing on a validation set.

Iv Experimental Setup

Iv-a Data

Below, we first describe the adopted datasets; then, we detail the procedure used to generate scribble annotations; finally, we define how we construct train, validation, and test set. We consider medical and vision datasets, for the segmentation of heart, abdominal organs, and human pose parts:

  1. ACDC [bernard2018deep]. The 2017 Automatic Cardiac Diagnosis Challenge dataset contains 2-dimensional cine-MR images obtained by 100 patients using various 1.5T and 3T MR scanners and different temporal resolutions. Manual segmentations are provided for the end-diastolic (ED) and end-systolic (ES) cardiac phases for right ventricle (RV), left ventricle (LV) and myocardium (MYO). We resample the data to 1.51

    and cropped or padded them to match a size of

    . We normalize the images of each patient by removing the median and dividing by the interquartile range computed per volume.

  2. LVSC [suinesiaputra2014collaborative]. The Left Ventricular Segmentation Challenge dataset, part of the Cardiac Atlas Project, contains gated SSFP cine images of 100 patients, obtained from a mix of 1.5T scanner types and imaging parameters. Manual segmentations are provided for the left ventricular myocardium (MYO) in all the cardiac phases. To compare with ACDC, we only consider segmentations for ES and ED phase instants. We resample the images to the average resolution of 1.45 and crop or pad them to a size of . We normalize the images of each patient by removing the median and dividing by the interquartile range computed on his MRI scan.

  3. CHAOS [chaos]. It contains abdominal MR images of 20 subjects, alongside segmentation masks for their liver, kidneys, and spleen. We test the robustness of our model on the T1 in-phase and T2 images. Images are resampled to a resolution of 1.89 and cropped to pixels, after being rescaled in .

  4. PPSS [luo2013pedestrian]. To demonstrate broad utility of our approach we use the (non-medical) Pedestrian Parsing in Surveillance Scenes dataset. PPSS contains RGB images of pedestrians with occlusions, extracted from 171 surveillance videos, using different cameras and resolutions. Alongside the images, ground truth segmentations are provided for seven parts of the pedestrians: hair, face, upper clothes, arms, legs, shoes, and background. Since segmentations are provided with size , we resample all the images to the same spatial resolution. Moreover, we normalize images between 0 and 1, dividing them by their maximum value.

Scribble Generation

Experts draw scribbles in a certain way (e.g., away from border regions). A dataset containing manual scribbles helps test a method more realistically than using simulated data from automatic procedures. Thus, in ACDC, we use ITK-SNAP [itksnap] to manually draw scribbles for ES and ED phases within the available segmentation masks. We obtained separate scribbles for RV, LV, and MYO, enabling us to test against ground truth segmentations. To identify pixels belonging to the background class (BGD), we additionally draw a scribble approximately around the heart, while leaving the rest of the pixels unlabeled. Scribbles for RV, MYO, LV, BGD had an average (standard deviation) image coverage of 0.1 (0.1)%, 0.2 (0.1)%, 0.1 (0.1)% and 10.4 (8.4)%, respectively.

For CHAOS and PPSS, we obtained scribbles by eroding the available segmentation masks [rajchl2017employing]. For each object, we followed standard skeletonisation by iterative identification and removal of border pixels, until connectivity is lost. Resulting scribbles are deterministic, typically falling along the object’s midline (as with manual ones [lin2016scribblesup]).

For LVSC, since MYO is thin, a skeleton is already too good of an approximation of the full segmentation mask. Instead, we opt to generate scribbles with a random walk inside the mask. For every object, we first initialize an “empty” scribble, and we define the 2D coordinates of a random pixel inside the segmentation mask. Then, we iterate 2500 times the steps: i) assign to the scribble the point ; ii) randomly “move” in the image, adding or subtracting 1 to the coordinates of ; iii) if the new point belongs to the segmentation mask, assign the new coordinates to . Scribbles for MYO and BGD had an average (standard deviation) image coverage of 0.2 (0.1) % and 1.9 (0.5) %, respectively.

Dataset Splits

We divided ACDC, LVSC, CHAOS-T1 and CHAOS-T2 datasets in groups of 70%, 15% and 15% of patients for train, validation, and test set, respectively. Following seminal semi-supervised learning approaches [chartsias2019disentangled, salimans2016improved], we additionally split the 70% of training data into two halves, the first of which is used to train the segmentor with weak labels (image-scribble pairs), while we use only the masks of the second half to train the discriminator . Correlations between groups are limited by: i) splitting the data by patient, rather than by images (limiting intra-subject leakage, as masks come from different subjects [chartsias2019disentangled]); and ii) discarding images associated to masks used to train the discriminator (thus, never sees images used to train ).

For LVSC and ACDC, we also use the unlabeled images available for time instants different from ED and ES cardiac phases. Lacking any annotation, these images are only used to minimize the adversarial contribution of the loss function.

For PPSS, following [luo2013pedestrian], we use the video scenes from the last 71 cameras as test set, while we split images from the first 100 cameras to train (90% of images) and validate (10% of images) the model. As with the medical datasets, we further divide the training volumes into two halves, and we use one of them to exclusively train the discriminator, using the segmentation masks and discarding the associated images.

Iv-B Baseline, Benchmark Methods and Upper Bounds

We evaluate the robustness of our method in terms of segmentation performance. We compare the results with:

  • UNetPCE and UNetWPCE [tang2018normalized]: The UNet [ronneberger2015u] is one of the most common choices for training with fully annotated segmentation masks. We evaluate its behavior when trained with the PCE loss proposed for scribble supervision in [tang2018normalized], or the WPCE loss introduced in (1).

  • UNetCRF: We also consider the previous UNetWPCE whose prediction is further processed by CRF as RNN layer [chen2017deeplab, zheng2015conditional]

    , because this can be trained end-to-end and does not require relabeling the training set. For ACDC and LVSC, we train the RNN using the same hyperparameters employed for cardiac segmentation in 

    [can2018learning]: and , number of iterations . We use for the other datasets, as proposed in the original CRF as RNN [zheng2015conditional].

  • TS-UNetCRF: We compare our model to the two-steps procedure in [can2018learning]. We use the variant modeling CRF as a learnable layer [zheng2015conditional] rather than a separate post-processing step, because no relevant difference was observed between the two, and this is simpler to use at inference.

We specify that these approaches do not exploit unlabeled images, nor discriminators during training. Here, we investigate their performance in several training scenarios, including a varying number of available scribbles, and we compare to our model that, instead, can also leverage the information inside unlabeled images. A comparison with segmentors using also adversarial discriminators is reported during the ablation study (Section V-E). Finally, we consider two upper bounds, based on training with fully annotated segmentation masks:

  • UNetUB: UNet trained with strong annotations. In this case, we train the UNet in a fully-supervised way using image-segmentation pairs and a weighted cross-entropy loss (with per-class weights defined as in (1)).

  • UNetDUB: UNet as before, but with an additional vanilla mask discriminator, used to train on the unlabeled images. The discriminator is the same as that of our model, but it receives an input only at the highest resolution.

To compare methods, we always use same UNet segmentor, learning rate, batch size, and early stopping criterion. We also use same data splits during cross-validation. If a method does not use a discriminator, we simply discard the data we would have used to train . Moreover, similar to Can et al. [can2018learning], we train the CRF as RNN layer of TS-UNetCRF with a learning rate times smaller than that used for the UNet training, and we update the RNN weights only every 10 iterations.


To measure segmentation performance, we use the Dice score. To assess if the improvement is statistically significant we use the non-parametric Wilcoxon test, and use one (*) or two (**) asterisks to denote statistical significance with or , respectively. To avoid multiple comparisons, we compare our method only with the best model among the benchmarks.

V Experiments and Discussion

Here we present and discuss the performance of our method in various experimental scenarios. Our primary question is: Can scribbles replace per-pixel annotations (Section V-A); and what happens when we have fewer scribble annotations (Section V-B)? Then, we consider two natural questions that extend the applicability of our approach: Can we learn from multiple scribbles per training image (Section V-C)? Can we mix per-pixel annotations with scribbles during training (Section V-D)? Finally, we ask: Why does Adversarial Attention Gating work (Section V-E)?

Figure 3: Example of predicted segmentation masks for the considered methods on each task. Observe that our approach (rightmost column) learns spatial relationships in the image, thus preventing the prediction of isolated pixels in the mask, as well as unrealistic spatial relationship among the object parts.

Supervision Type


UNetPCE   79.006   62.309  34.406 37.506   71.904
UNetWPCE   69.407   59.107  40.005 52.105   69.304
UNetCRF   69.607   60.408  40.505 44.706   68.804
TS-UNetCRF   37.308   50.507  29.305 27.605   67.104
Ours **84.304 **65.508 *56.805 57.804 **74.604


UNetUB   82.005   67.207  60.806 58.601   72.804
UNetDUB   83.905   67.909  63.905 60.801   77.204
Table I: Dice average and standard deviation (subscript) obtained from each method on medical and vision datasets. Leftmost column indicates if the learning algorithm has been trained with full mask or scribble annotations. Asterisks denote when our method is statistically better than the best of the scribble benchmarks (* , ** ).
Figure 4:

Dice score obtained by each method when changing the percentage of available annotation in the training set (shaded bands show standard errors instead of deviation for clarity). As upper bound (U.B.) we consider the UNet

DUB, trained using all the fully annotated segmentation masks. Asterisks (*,**) have the same role as in Table I.

V-a Learning from Scribbles

A prime contribution of our work is to narrow down the performance gap between the most common strongly supervised models and weakly supervised approaches. Thus, we compare our method with other benchmarks and upper bounds quantitatively, in Table I, and qualitatively, in Fig. 3.

In particular, Table I reports average and standard deviation of the Dice score on test data for each dataset. We clarify that, as discussed in Section IV-A, these results refer to training the segmentors with half of the annotated training images.

Our method matches and sometimes even improves the performance of approaches trained with strong supervision. As an example, we improve the Dice score of UNetUB on both ACDC and PPSS. A result that further confirms the potential of weakly supervised approaches that use annotations which are much easier to collect than segmentation masks.

Moreover, as can be seen from the upper part of the table (methods trained with scribble supervision), we consistently improve segmentation results. When compared to the best benchmark model, we obtain up to about 15% of improvement on CHAOS-T1 (second best is UNetCRF). We speculate that such performance gains are due to the multi-scale interaction between adversarial signals and attention modules, which regularizes the segmentor to predict both locally and globally consistent masks. Conversely to competing methods, our training strategy enforces shape constraints in the model, preventing the appearance of isolated pixels and unrealistic spatial relationships between the object parts (see Fig. 3).

Interestingly, we observe that weighting the loss contribution of each class based on their numerosity (UNetPCE vs. UNetWPCE) is not always beneficial to the model, probably because, being sparse, scribble supervision suffers less than mask supervision from the class unbalancing problem. We also did not find evident performance boost in using CRF as RNN to post-process the UNet predictions (UNetWPCE vs. UNetCRF).

Finally, the two-step paradigm of TS-UNetCRF is the worst. We motivate this observing that errors reinforce themselves in self-learning schemes [chapelle2009semi], and unreliable proposals in the relabeled training set lead the retrained model to fit to errors.

V-B Semi-supervised Learning: Model Robustness to Limited Annotations

We analyze the robustness of the models with a scarcity of annotations in Fig. 4. We always use 50% of training data to exclusively train the discriminator, if present in the method. The remaining 50% is used to train the segmentor , with varying amount of labels: e.g. “5%” means we train with 5% of labeled and 45% of unlabeled images (adversarial setup). As upper bound, we consider the results of UNetDUB, trained with all the available image-segmentation pairs.

As shown in Fig. 4, our model can rapidly approach the upper bound and, overall, it shows the best performance for almost every percentage of training annotations.

Figure 5: (a) Effect of training with labels from multiple annotators; and (b) performance in presence of mixed supervision (mask and scribbles) on ACDC. The upper bound (U.B.) is the UNetDUB, trained with all the dense segmentation masks.

V-C Combining Multiple Scribbles: Simulating Crowdsourcing

Here we investigate the possibility to train our model using multiple scribbles per training image. This scenario simulates crowdsourcing applications, which have shown to be useful for annotating rare classes or to exploit various levels of expertise in annotators [lin2014microsoft, orting2019crowdsourcing]. Here, we mimic the availability of scribble annotations collected by three different “sources”, using: i) expert-made scribbles; ii) scribbles approximated by skeletonization of the segmentation masks; iii) scribbles approximated by a random walk in the masks (see Section IV-A for a description of ii) and iii)).

For every training image, we combine multiple scribbles summing up the supervised loss (1) obtained for each of them: . Thus, we consider multiple times pixels that are labeled across annotators, while considering ‘once’ pixels labeled only from one annotator. Other ways of combining annotations are also possible (e.g, considering the union of the scribbles, or weighting differently each annotator [orting2019crowdsourcing]), but they are out of the scope of this manuscript.

In Fig. 5a, we compare the Dice score of our method trained in a “single” vs. a “multiple” annotator scenario. As can be seen, multiple scribbles have a regularizing effect when the number of annotated data is scarce.

Figure 6: UNet-like segmentor with (top) vs. without (bottom) adversarial conditioning of the attention gates in its decoder. Conditioned by an adversarial shape prior (w/ ADS), the model learns semantic attention maps able to localize the object to segment at multiple scales. Also, the shape prior encourages the segmentor to learn multi-scale relationships in the objects.

V-D Multitask Learning: Combining Mask and Scribble Supervision

Collecting homogeneous large-scale datasets can be difficult, but we have often access to multiple data sources, that can have different types of annotations. Here, we relax the assumption of using only scribble annotations, and we investigate if we can train models that also leverage extra fully annotated data. For simplicity, we assume to have 5% of scribble annotations and we gradually introduce from 0% to 25% of fully-annotated images. We train the model in a multi-task learning setup, using as loss: (1) for scribble-annotated data, (2) for unlabeled data, and the weighted cross-entropy for fully annotated images. We report the results on ACDC in Fig. 5b, showing that mixing scribble and mask supervision is feasible, and it can increase the model performance.

V-E Why does Adversarial Attention Gating work?

Prior-conditioned Attention Maps are Object Localizers

Here we show that, contrary to canonical attention gates, AAGs act as object localizers at multiple scales. In detail, we consider our attention mechanism with or without the adversarial conditioning (ADS). In both cases, the probability attention map is obtained as in Section III-B, and results from a convolutional layer with softmax activation (that can be interpreted as a classifier), and a sum operation on all but one channel. In Fig. 6 we illustrate: i) the most active channels in the classifier output, and ii) the predicted attention maps, at multiple depth levels . As the attentions maps show (Fig. 6, top), the adversarial conditioning of the attention gates encourages the segmentor at multiple scales to i) learn to localize objects of interest; and ii) suppress activations outside of them. Thus, scattered false positives (see UNet’s prediction for in Fig. 6) are prevented, and the model performance improves (see also Fig. 3).

Adversarial Attention Gating Trains Deep Layers Better

We qualitatively show that AAGs increase the training of the segmentor deepest layers. In Fig. 7, we show the distribution of weights values in the convolutional layers at depth in absence vs. presence of adversarial conditioning (ADS) of the attention gates. As shown, attention gates with ADS force the segmentor to update its weights also in deeper layers, which would otherwise suffer from vanishing gradients [szegedy2015going, lee2015deeply].

Figure 7: Weight distribution for the convolutional layers at depth d=4 of the segmentor. We compare how the weight distribution changes during training, with and without the use of ADS on the segmentor. Notice that ADS helps the layer training, and the initially narrow distribution becomes broader in time.
Model 5% 25% 50%
Ours 40.709 80.606 84.305
w/o: Attention 38.413 79.106 83.804
w/o: ADS 39.410 77.307 84.005
w/o: ADS, Attention 55.810 60.207 61.808
w/o: ADS, Discriminator 34.809 71.608 71.008
w/o: ADS, Discrim., Attention 32.109 68.309 69.407
Table II: Ablations with various amounts of labels in ACDC.
Ablation Study

We show ablations on ACDC in Table II. When we remove ADS, we leave the discriminator as a vanilla one, receiving inputs only at the highest resolution. Where not explicit, removing ADS we leave the attention gates in the segmentor, but without the adversarial conditioning. As shown, each model component contributes to the final performance.

Vi Conclusion

We introduce a novel strategy to learn object segmentation using scribble supervision and a learned multi-scale shape prior. In an adversarial game, we force a segmentor to predict masks that satisfy short- and long-range dependencies in the image, narrowing down or eliminating the performance gap from strongly supervised models on medical and non-medical datasets. Fundamental to the success of our method are the proposed generalization of deep supervision and the novel adversarial conditioning of attention modules in the segmentor.

We show the robustness of our approach in diverse training scenarios, including: a varying number of scribble annotations in the training set, multiple annotators for an image (crowdsourcing), and the possibility to include fully annotated images during training. Hoping to inspire new studies in weakly-supervised learning, we will release manual scribble annotations for ACDC data, and the code used for the experiments.