Weakly Supervised Few-Shot Segmentation Via Meta-Learning

by   Pedro H. T. Gama, et al.

Semantic segmentation is a classic computer vision task with multiple applications, which includes medical and remote sensing image analysis. Despite recent advances with deep-based approaches, labeling samples (pixels) for training models is laborious and, in some cases, unfeasible. In this paper, we present two novel meta learning methods, named WeaSeL and ProtoSeg, for the few-shot semantic segmentation task with sparse annotations. We conducted extensive evaluation of the proposed methods in different applications (12 datasets) in medical imaging and agricultural remote sensing, which are very distinct fields of knowledge and usually subject to data scarcity. The results demonstrated the potential of our method, achieving suitable results for segmenting both coffee/orange crops and anatomical parts of the human body in comparison with full dense annotation.



There are no comments yet.


page 5

page 7


Weakly Supervised Medical Image Segmentation

In this paper, we propose a novel approach for few-shot semantic segment...

Low-Shot Learning for the Semantic Segmentation of Remote Sensing Imagery

Recent advances in computer vision using deep learning with RGB imagery ...

Scribble-based Weakly Supervised Deep Learning for Road Surface Extraction from Remote Sensing Images

Road surface extraction from remote sensing images using deep learning m...

A Comparison and Strategy of Semantic Segmentation on Remote Sensing Images

In recent years, with the development of aerospace technology, we use mo...

Towards Open-Set Semantic Segmentation of Aerial Images

Classical and more recently deep computer vision methods are optimized f...

Panoptic Segmentation Meets Remote Sensing

Panoptic segmentation combines instance and semantic predictions, allowi...

Hierarchical Semantic Segmentation using Psychometric Learning

Assigning meaning to parts of image data is the goal of semantic image s...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image segmentation is a classical computer vision problem where, given an image, a model is required to assign a class to every pixel, defining fine boundaries to the objects of interest that compose the image. It has applications in many scenarios, including medical image analysis [ronneberger2015u, wang2018interactive], remote sensing [nogueira2015coffee, kataoka2016semantic]

, and others. State-of-the-art approaches to segmentation mostly use Deep Neural Networks (DNNs) methods, especially variations of Convolutional Neural Networks (CNNs). These approaches became popular after the work

[krizhevsky2012imagenet] and the advances of Graphical Processing Units (GPUs) that allowed the training of large complex models. The main limitation of current state-of-the-art deep models is the reliance on a large annotated training set, hampering the use of such models in more specific real-world scenarios out of the mainstream visual learning tasks. It is common for DNNs to present underfitting or overfitting [Goodfellow-et-al-2016] problems when trained with a limited amount of data samples. Common semantic segmentation methods rely on labels for all pixels in an image. From now on, this annotation strategy will be referred to as full/dense annotation, being characterized by the highly laborious process required for producing such ground truths. The expensive process for producing dense annotations is further aggravated in certain scenarios such as medical imaging or remote sensing, where usually only specialists are able to produce labels correctly. Thus sparse annotations becomes an interesting solution, as they consist in only presenting a label for a small set of pixels of the image. This type of annotation reduces the time required to produce the labels for an image but, it can be challenging to train a model with such limitations on the amount of information available. Multiple methods [lin2016scribblesup, vernaza2017learning, wang2018interactive] have successfully used sparse labels for image segmentation. Another strategy to reduce the cost of labeling datasets is to reduce the total number of images in it, and consequently the number of labeled images. Such scenarios with small dataset sizes are commonly known as few-shot and have recently gained the interest of the computer vision community. The few-shot learning literature contains a vast amount of works focused on image classification with notable examples [vinyals2016matching, snell2017prototypical, finn2017model, raghu2019rapid], although some methods for semantic segmentation have been proposed in recent years [dong2018few, rakelly2018conditional, hu2019attention, zhang2019sgone, wang2019panet]. One methodology that has been successfully applied to few-shot problems in recent years is the meta-learning framework [vinyals2016matching, snell2017prototypical, finn2017model]. Normally understood as learning to learn, meta-learning is an umbrella term for a collection of methods that improve the generalization of a learning algorithm through multiple multi-task learning episodes. A recent survey [hospedales2020metalearning] formalizes the meta-learning framework and proposes different forms to categorize methods that use this approach. One can summarize meta-learning methods as an algorithm that learns a set of parameters called meta-knowledge, trained using a distribution of tasks, such that generalizes well for the tasks in said distribution. The way a model achieves the training of the meta-knowledge is used to group meta-learning methods into categories. In this work, we extensively evaluated our previously proposed method WeaSeL [gama2021weakly] in a vast array of scenarios. Additionally, we introduced a fully novel semantic segmentation method (ProtoSeg), to problems with few-shot sparse annotated images. These two approaches are based on the meta-learning algorithms of Model-Agnostic Meta-Learning (MAML) [finn2017model], and Prototypical Networks (ProtoNets) [snell2017prototypical], respectively. The main contributions of this work are: (1) A novel meta-learning method for the problem of semantic segmentation with few-shot sparse annotated images; (2) An extensive evaluation of our previous and new proposals in a large collection of tasks from Medical and Remote Sensing scenarios; (3) A comparative analysis of five styles of sparse annotations named: Points, Grid, Contours, Skeletons, and Regions; and (4) Two novel publicly available crop segmentation datasets with semantic labels for coffee and orange orchard crop regions. The coffee crop dataset has been previously published in previous works [ferreira2018comparative, penatti2015deep], but only for the task of patch classification. This work will be the first time this dataset is made fully publicly available with its semantic segmentation labels.

Ii Related Work

Ii-a Weakly Supervised/Sparse Label Semantic Segmentation

Approaches to the problem of semantic segmentation with sparse labels can be mostly divided into two main groups: 1) methods that use the sparse labels without any kind of augmentation [cciccek20163d, bokhorst2018learning, silvestri2018stereology, zhu2019pick]; and 2) strategies that try to reconstruct dense annotations from the sparse labels [lin2016scribblesup, zhang2019sparse, bai2018recurrent, cai2018accurate]. In the first group, cciccek20163d and bokhorst2018learning use a weighted loss, silvestri2018stereology

imply the use of padding in the sparse labels, and

zhu2019pick use a quality model to ensure a good segmentation based of the sparse annotation. lin2016scribblesup are one of the first to use sparse labels for semantic segmentation. They use a label propagation scheme in conjunction with a FCN for segmentation. This propagation uses the scribble annotation provided and the prediction of the FCN network. They train their model by alternating which part is trained at each iteration. tajbakhsh2020embracing

present a thorough review of deep learning solutions to medical image segmentation problems. They include a section for segmentation with noisy/sparse labels, in which the methods belong to one of the two groups described previously. All the methods reviewed by


use a selective loss. That is, a type of loss function that has different weights to unlabeled pixels/voxels and thus can ignore such pixels when the total cost is computed.

Ii-B Few-Shot Semantic Segmentation

As in the few-shot classification problem, information from the support set has an important role in the semantic segmentation case. Multiple works try to insert this information directly into the model’s processing flow. Many works use a two-branch structure, where one branch is responsible for extracting information from the support samples, which is fused into the other branch that processes the query images. dong2018few

use a two-branch model, where one network produces prototypes of each class, similarly to ProtoNets. The first branch uses the support set and query image to produce prototypes, which are used for image classification in this branch. The encoded query image in the second branch is then fused with the prototypes to produce the probabilities maps.

hu2019attention introduce a highly interconnected two-branch attention-based model. The attention modules receive query and support feature maps, and are present in multiple layers of the model. zhang2019sgone present another two-branch model. One branch, called guidance

, is used to extract feature vectors from both query and support images. They compute class prototypes using masked average pooling in the features from support images. A similarity map between the query features and prototypes is calculated and fused with the query features to compute the final prediction. Other approaches use a single network to face the few-shot semantic segmentation problem.

wang2019panet propose a direct adaptation of the Prototypical Networks. They use a CNN to produce feature vectors of the images in the support set and compute the prototypes for each class using masked average pooling, as [zhang2019sgone]. During the training they include an alignment loss, where the prototypes are computed from the query image, and the support set is the segmentation target. rakelly2018conditional

proposed the Guided Networks (Guided Nets), the first algorithm for few-shot sparse segmentation. Although it fuses information from the support set in query features, this model uses a single feature extraction network. This network is a pre-trained CNN backbone that extracts features of the support set and the query image. The support features are averaged through a masked pooling using the sparse annotations provided for the images and further globally averaged across all the support images available. This single averaged support feature multiplies the query features, reweighting them. Then, these features are further processed by a small convolutional segmentation head that gives the final predictions. In Table 

I, we present a summary of the related works and how our proposed methods fit in the literature.

width= Work Semantic Segmentation Few-Shot Sparse Annotations lin2016scribblesup vernaza2017learning wang2018interactive zhang2019sparse bai2018recurrent cai2018accurate cciccek20163d bokhorst2018learning silvestri2018stereology zhu2019pick snell2017prototypical finn2017model dong2018few hu2019attention zhang2019sgone wang2019panet rakelly2018conditional WeaSeL (Ours) ProtoSeg (Ours)

TABLE I: Summary of related work.

Iii Methodology

Iii-a Problem Definition

For our problem setup, we employ most of the definitions from gama2021weakly. A dataset is a set of pairs , where is an image with dimensions and bands/channels, and is the semantic label of the pixels in the image. This dataset is partitioned in two sets: (support set) and (query set), such that . Given a dataset and target class , we define a segmentation task as a tuple (or, , for simplicity). A few-shot semantic segmentation task is a specific type of segmentation task. It is also a tuple , but the samples of have their labels sparsely annotated, and the labels in are absent or unknown. Moreover, the number of samples is a small number (e.g., or less); thus, we also call a few-shot task a -shot task. Finally, the problem of few-shot semantic segmentation with sparse labels is defined as follows. Given a few-shot task and available segmentation tasks , we want to segment the images from the using information from tasks , and information from the . Also, there is no information of the tasks target objects, other than the sparse annotations of samples. That is, no pair of image/semantic label of is present in any task in either or partition.

Iii-B Gradient-based Sparse Segmentation with WeaSeL

We reintroduce our previously proposed method in gama2021weakly. The Weakly-supervised Segmentation Learning (WeaSeL) is an adaptation of the supervised MAML algorithm [finn2017model], as depicted in Figure 1.

(a) Visualization of the meta-training process. The global parameter is optimized for different tasks obtaining the parameters through optimization using the sparse labels from the tasks support sets. The is an optimal parameter that could be obtained if the model were trained with the query set samples and dense annotations, which are only used to compute the task outer loss. This hypothetical difference between parameters is expected to be minimized during the meta-training, leading to a fast/better learner to the few-shot task.
(b) Illustration of the fine-tuning step. The meta-optimized is supervised trained with the sparse annotated samples of the few-shot support set. The labels of the query are unknown, i.e., not seen by the model.
Fig. 1: Illustration of the WeaSeL method with toy examples in the meta-training/meta-test phase (a), and in the few-shot tuning phase (b).

Our meta-tasks are segmentation tasks (i.e the set ), as defined in Section III-A. We employ the Cross-Entropy loss () commonly used in segmentation tasks, defined for a single pixel as:


where C is the number of classes, is the true probability of a class for the pixel, and is the predicted probability by the model for the class to the specific pixel. The loss in equation 1 is averaged over all pixels to produce the final loss for an image. The meta-tasks have dense annotated samples. To train the model in a scenario similar to the target few-shot task, we simulate sparse annotations for the samples in the meta-tasks support set. That is, for all , the labels of samples in are randomly converted to a sparse version of themselves (this operation will be further discussed in Section IV-B). With this, we expect that the model learns to predict dense labels from sparse annotations and more easily adapts to few-shot tasks. Given that the labels are sparse in the inner loop of meta-training, and during tuning in the few-shot task, we modify the classical Cross-Entropy to a Selective Cross-Entropy (SCE) loss as follows:


where is a pixel, is the total number of labeled pixels, and is a indicator, where , if is an unknown label and , otherwise. That is, ignores pixels with unknown labels via the binary weight parameter, averaging the loss for all pixels with annotations. Algorithm 1 summarizes the meta-training procedure using the segmentation meta-task distribution . In the inner loop, the loss is computed using the simulated sparse annotations of the support set of a task, and the outer loss using the dense labels of the query set of a task . After the meta-training phase, we adapt the model to the few-shot task, performing a simple fine-tuning with samples from the support set of the few-shot task . That is, we use pairs to train the model in a supervised manner using a Selective Cross-Entropy loss.

0:  : distributions over tasks

: step size hyperparameters

  Randomly initialize
  while not done do
     Sample batch of tasks
     for all  do
        Sample batch of datapoints from
        Compute using and
        Update parameters:
        Sample batch of datapoints from
     end for
     Update using and
  end while
Algorithm 1 Training algorithm for WeaSeL .

Iii-C Prototypical Seeds for Sparse Segmentation Via ProtoSeg

The proposed method for semantic segmentation based on the Prototypical Networks [snell2017prototypical] is a straightforward adaptation of the original method. It uses the same premise of constructing a prototype vector to each class, with the distinction that prototypes are computed using the labeled pixels instead of whole image instances. Given a support set , where is an image with height , width and channels, and is a label image with the semantic class of each pixel in . Since can be sparse, the possible values of an pixel in are in the set , where is the total of classes and represents the unknown class. In our adaptation of ProtoNets, we define the -dimensional prototype vector , of a class as:


where is our embedding function parametrized with (a CNN), is point-wise multiplication, and is a mask matrix where each value is defined as

And is the total number of pixels of the class , across all the support set . This means that our prototype vector is the mean vector of all pixels of a class existent in the support set. This is similar to a masked average pooling, but considering all pixels globally, opposed to averaging for each sample and then averaging these pooled vectors. (See Figure 2).

Fig. 2: Illustration of our Average Pooling. After masking the features our process create the global sample average by considering all pixels in the set.

The inference is the same of the original Prototypical Networks, but applied to a pixel in the image. Formally, the probability of a pixel of a query image belonging to a class is computed as follows:


where is the squared euclidean distance: . Similar to the case of the WeaSeL method, given the presence of unknown labeled pixels in training, we modify our loss function to ignore such pixels. We define our new loss function as follows:


where is the set of images used to compute the loss, is a pixel coordinate and represents a class. Note that starts from , thus not considering the unknown class . We use as defined in equation 4. Given equations 3 and 4, the model is trained using a episodic training strategy. This strategy resembles the training algorithm of WeaSeL and is presented in algorithm 2. It uses the same distribution over tasks as the first method, with the automatically generated sparse annotations of the meta-tasks in training. At each iteration, a batch of tasks is sampled, and for each task , a support set is constructed and used for training.

0:  : distributions over tasks
  randomly initialize
  while not done do
     Sample batch of tasks
     for all  do
        Sample a support set from
        Compute using , for all using equation 3
        Sample query batch from
        Compute the loss as defined in equation 5, using .
        Update using gradient descent and
     end for
  end while
Algorithm 2 Training algorithm for ProtoSeg .

Iv Experimental Setup

In this section, we present the configurations used for our experiments. In Section IV-A, we briefly present the datasets used. The evaluated sparse labels annotation styles are listed in Section IV-B. Next, in Section IV-C, we introduce the FCN architecture used, and in Section IV-D

the baselines, protocol, and metrics are presented. All the code for the experiments were written in the Python3 language. For the models, we use the Pytorch

111https://pytorch.org framework and the Torchmeta222https://github.com/tristandeleu/pytorch-meta module. In relation to the machine, all the experiments were performed on Ubuntu SO, 64-bit Intel i9 7920X machine with 64GB of RAM memory, and a GeForce RTX 2080 TI/Titan XP GPU (only one GPU was used during the experiments).

Fig. 3: Examples of biomedical imaging datasets included in our medical meta-dataset.

Iv-a Datasets

We design experiments to evaluate the proposed methods in medical and remote sensing applications for semantic segmentation. These areas have some similarities that contrast them from others. Their images are rather distinct from common RGB images taken with surveillance or cellphone cameras, for instance. This hinders the knowledge transfer from other generic domains or the use of pre-trained models on large datasets such as ImageNet. Another common aspect of these two areas is the limitation of availability of images due to multiple factors. Medical image datasets have to face privacy and ethical concerns, also requiring a highly specialized radiologist to provide precise annotations. In remote sensing the annotation is extremely laborious and sometimes unfeasible since it typically requires that a specialist collect information from large geographical areas, maybe even requiring visits to the site for producing ground truths for these data.

Iv-A1 Medical Imaging Datasets

We use a total of ten medical datasets in our experiments (Figure 3). Of these datasets, six are Chest X-Ray datasets (CRX): JSRT [JSRTshiraishi2000development] with labels for lungs, heart and clavicles; the Montgomery/Shenzhen sets [jaeger2014two], an annotated subset of Chest X-Ray 8 [NIHwang2017chestx] by NIHtang2019xlsor referred to as NIH-labeled, OpenIST333https://github.com/pi-null-mezon/OpenIST with labels for lung segmentation, and the LIDC-IDRI-DRR dataset [LIDColiveira20203d], with generated ribs annotations. We include two Mammographic X-Ray (MRX) image sets, namely INbreast [moreira2012inbreast] and MIAS [MIASsuckling1994mammographic], with labels for breast region and pectoral muscle segmentation. Also, two Dental X-Ray (DRX) datasets are included: Panoramic X-Ray [PANORAMICabdi2015automatic] with labels for the inferior mandible and IVisionLab [IVISIONsilva2018automatic] annotated for teeth segmentation.

Iv-A2 Remote Sensing Datasets

The Remote Sensing meta-dataset is composed of rural scenes for crop segmentation (Figure 4). More specifically, we use the Brazilian Coffee dataset, composed of images of 4 municipalities – namely, Arceburgo, Guaranésia, Guaxupé and Montesanto – with pixel-level annotations for coffee crops regions, as well as the Orange Orchards (Ubirajara county, Brazil) dataset, with annotations for orange crop regions. Both datasets will be made public available upon the acceptance of this work.Further description of all datasets are presented in the supplementary material.

Fig. 4: Examples from the Brazilian Coffee and Orange Orchards Datasets.

Iv-B Types of sparse annotation

Fig. 5: Illustration of the types of sparse annotations used. Annotations are illustrative and uspcaled to better visualization.

In the experiments we evaluate five types of sparse annotation, namely: points, grid, contours, skeletons, and regions. As mentioned, we simulate these annotations from the original dense labels of an image. Visual examples of these annotations are show in Figure 5. We can describe each type of annotation and explain how the sparse annotations are generated as follow:

  1. Points: It simulates an annotator alternately picking pixels from the foreground and background classes. We use a parameter and randomly choose pixels from the foreground and from the background. The remainder pixels are set as unknown.

  2. Grid: The annotator receives a pre-selected collection of pixels of the image, which are initially assumed to be from the background class. The pixels considered foreground should be annotated. These pre-selected pixels are disposed of in a grid pattern that was generated by using a parameter . First, a random pixel is selected within the following rectangular region: {upper left corner: and bottom right corner: }. Afterward, a grid is created from this position with spacing horizontally and vertically. Pixels outside the grid are set as unknown.

  3. Contours: The annotator denotes the inner and outer boundaries of foreground objects. This style is useful for cases where a single connected foreground object is present. We simulate these annotations by using morphological operations on the original binary dense labels. We used an erosion operation followed by a marching squares algorithm444https://scikit-image.org/docs/0.8.0/api/skimage.measure.find_contours.html#find-contours to find the inner contours. To the outer contours, we use a dilation operation on the original label mask and the same marching squares algorithm. Additionally, we use a parameter that determines the density of the sparse annotation.

  4. Skeleton-Based Scribble: It resembles an annotator drawing a scribble roughly at the center of the foreground objects that more or less approximate the object form. The same process is applied to the background. These annotations are generated using the skeletonize algorithm555https://scikit-image.org/docs/0.8.0/api/skimage.morphology.html?highlight=skeletonize#skeletonize in the binary dense label masks, which returns the skeletons of the foreground objects. The same process is applied to the negative dense label masks to obtain the skeleton of the background class. Dilation is applied to add thickness to the skeletons. We use a parameter to control the density of the annotation. We generate random binary blobs (using this666https://scikit-image.org/docs/dev/api/skimage.data.html#skimage.data.binary_blobs function) that occupy percentage of the image space and use them to mask the computed skeletons.

  5. Regions: This type of annotation represents the process of an annotator appointing classes to pure superpixels. We define a pure superpixel as a usually small connected set of pixels with the same class. The annotator is provided with the superpixels of the image and then appoints the class of a subset of pure foreground and background superpixels. To generate these annotations, first, we compute the superpixels of the images using the SLIC algorithm [slic2012achanta] with empirically chosen parameters for each dataset. Once superpixels were computed, we randomly selected a percentage of the superpixels for the foreground and a percentage of superpixels for the background.

Iv-C miniUNet architecture

The network model used in all experiments – Baselines, WeaSeL, and ProtoSeg– is a simplified version of the UNet architecture [ronneberger2015u]. We will call it miniUNet since it is a smaller version of the original network. More information about the miniUNet architecture can be seen in the supplementary material of this manuscript. In ProtoSeg, since we want to generate -dimensional feature vectors, the last layer of the network is ignored and the output is gathered from the last decoder block. That is, the embedding function is the miniUNet model excluding the last convolutional layer, with this, the prototypes are -dimensional.

Iv-D Evaluation Protocol

Iv-D1 Baselines

We use two baselines for comparison with our approaches: 1) From Scratch and 2) Fine-Tuning. Given our few-shot semantic segmentation problem parameters in the form of the set of segmentation task , and a few-shot task , we define our baselines as follow: From Scratch: Given our miniUNet network, we perform a simple supervised train with the Few-shot task support set (). We use the same Cross-Entropy loss ignoring unlabeled pixels as our cost function (Equation 2). The Adam optimizer [kingma2017adam] was used, with the same parameter used in the training of our methods. Fine-tuning: We use the miniUNet architecture. We choose one task from our tasks set, and perform a supervised train with the . Once finished the training on , we perform the fine-tune (a supervised training) using the set. Again, the same Cross-Entropy loss function (Equation 2) is used with the same parametrized Adam optimizer. We choose to not present the Guided Nets [rakelly2018conditional] as a baseline in this work. The use of pre-trained CNN as features extractors seen to be essential to the efficiency of the model, and did not translate well to our evaluate scenarios (medical and remote sensing). To the best of our efforts the model was not able to converge to a usable model with our Meta-Datasets. Thus, it did not seem fair to compare the Guided Nets to our approach.

Iv-D2 Protocol and Metrics

In order to assess the performance of our methods in a certain setting, we employ a Leave-One-Task-Out methodology. That is, all but the pair chosen as the Few-shot task () are used in the Meta-Dataset, reserving for the tuning/testing phase. This strategy serves to simultaneously hide the target task from the meta-training while also allowing the experiments to evaluate the proposed algorithm and baselines in a myriad of scenarios. Moreover, we divide our tasks into two groups to perform the experiments: (I) Medical Tasks: all the medical datasets and their classes are used for these tasks, totaling tasks; (II) Remote Sensing Tasks: we used rural datasets for these tasks. There are tasks in total ( from the Brazilian Coffee and

from the Orange Orchards dataset). For each method, we used a different number of epochs in each of their training phases. In table 

II, we show these numbers that differ mostly due to training time. The Remote Sensing datasets are, in general, larger than the Medical datasets, and this made the training process (that includes validation) more time-consuming. We use the Adam optimizer [kingma2017adam] with learning rate , weight decay , and momentum . Our batch size was set to . The number of tasks sampled for the inner loop of WeaSeL and ProtoSeg was set to in Medical experiments and in Remote Sensing experiments due to memory constraints, in general, and the total number of tasks in Remote Sensing experiments.

width= Method Medical Experiments Remote Sensing Experiments Total Epochs Pre/Meta-Training Tuning Pre/Meta-Training Tuning WeaSeL 2000 80 200 40 ProtoSeg 2000 - 200 - Fine-Tuning 200 80 100 80 From Scratch - 80 - 100

TABLE II: Number of epochs for training the methods in different experiments.

We use a 5-fold cross-validation protocol in the experiments. Each dataset had a training and validation partition for each fold. Once fix the experiment fold, the support sets for the tasks are obtained from the training partition of the dataset, while the query sets are the entire validation partition. All images and labels are resized to for remote sensing images and for medical images prior to being fed to the models. This was due to our infrastructure limitations and done to standardize the input size and minimize the computational cost of the methods, especially the memory footprint of WeaSeL method, due to the computation of second derivatives, on high-dimensional outputs. The metric within a fold is computed for all images in the query set according to the dense labels, and is averaged in relation to the images in that fold. The metric used is the Jaccard score (or Intersection over Union – IoU) of the validation images, a common metric for semantic segmentation.

V Results and Discussion

In this section, we present and discuss the results of our experiments. Section V-A shows a comparison of the results of the proposed methods and baselines using multiple sparse annotations and their densely annotated counterparts. Section V-A1 focuses on the medical imaging datasets, while Section V-A2 describes the results obtained from remote sensing data. At last, in Section V-B we evaluate different sparse annotation styles regarding the number of user inputs and segmentation performance.

V-a Few-shot Semantic Segmentation: Sparse vs Dense labels

In this section, we present the results of our methods in multiple few-shot tasks in the Medical and Remote Sensing scenarios. We evaluated different number of shots and parameters for each type of sparse annotation. In Sections V-A1 and V-A2, we present the results grouped in plots organized by sparse annotation type and number of shots. Dashed lines in the graphs represent the scores of the methods trained with dense annotations.

V-A1 Medical Tasks

Analyzing the results in the CRX tasks, two trends can be easily seen. First, an obvious insight that holds for most methods and scenarios is that better scores are obtained with more data. Larger support sets (higher -shots) and more sparsely annotated pixels result in better performance for the algorithms. A second observed result is that WeaSeL surpassed the performances of ProtoSeg and baselines in tasks with a larger domain shift to other tasks in the meta-dataset. This was observed mainly in the JSRT Lungs (Figure 6) and the JSRT Heart (Figure 7) experiments, as the JSRT dataset is visually the most distinct of the CRX datasets. Additionally, the Heart class is annotated only in this dataset, resulting in a large domain shift in the semantic space for this task in comparison to the other tasks used in the meta-training.

Fig. 6: Jaccard score of experiments with JSRT Lungs task.
Fig. 7: Jaccard score of experiments with JSRT Heart task.

For the remaining tasks, we observe that either the ProtoSeg method or some fine-tuning baseline is the best performer. Since some datasets are visually similar, fine-tuning for a task from a model trained in a similar dataset is a known viable solution that works well in these cases. Fine-tuning from similar tasks (e.g., OpenIST, Montgomery, or Shenzhen for lung segmentation) yields the best Jaccard scores in most cases, as exemplified in Figure 8 for the OpenIST dataset. Also, we observe that the ProtoSeg is consistently comparable to these fine-tuned baselines. The same plots (as in Figure 8) for Montgomery and Shenzhen tasks, omitted in this text due to size constraints, can be found in the supplementary material.

Fig. 8: Jaccard score of experiments with OpenIST Lungs task.

MRX and DXR tasks present similar tendencies as the ones observed in the CRX datasets. Again, fine-tuning from similar tasks appears as a solid solution, with WeaSeL obtaining comparable results to the baselines in most cases. In the MIAS Breast task (Figure 9), fine-tuning from the INbreast Breast proved to be the best method, mainly due to having the same semantic space on the source and target domains, closely followed by WeaSeL in most scenarios. Two DXR datasets are included in the meta-dataset in our experiments, assuring that in experiments with one DXR dataset as target, the other one is always used for the pretraining. However, the Panoramic dataset is labeled for mandibles while IVisionLab data are labeled for teeth, hence never sharing the same label space. In this scenario, without a task with similar semantic space to fine-tune from, the WeaSeL yields the best performance in the segmentation tasks, achieving the highest scores in the majority of experiments with both dense and sparse annotations. This can be observed in Figure 10 for the Panoramic Mandible task. Being the most distinct tasks, even the from scratch baseline yields comparable results to the more complex alternatives, in some cases even achieving the best results. ProtoSeg underperforms by a large margin in comparison to the other methods in Panoramic Mandible, which can be explained by the low prevalence of DXR data in the meta-dataset used for meta-training and by the large semantic space domain shift even among DXR datasets.

Fig. 9: Jaccard score of experiments with MIAS Breast task.
Fig. 10: Jaccard score of experiments with Panoramic Mandible task.

V-A2 Remote Sensing Tasks

In general, all remote sensing tasks proved to be considerably harder than the medical ones. Overall, no method achieved a Jaccard score above in any of the evaluated tasks, not even when using dense labels. Figures 11 and 12 depict the results for the Montesanto Coffee and Arceburgo Coffee tasks, from the Brazilian Coffee dataset, while Figure 13 shows results for the Orange Orchard task. One can easily observe that the WeaSeL method consistently outperforms fine-tuning and from scratch baselines, especially in configurations with few data (1-shot tasks). Although having the same label space, the coffee segmentation presents an intrinsic domain shift across the 4 different counties in the dataset. This is due to distinct geographical features, coffee crop cycles and/or plantation methods, explaining why simple fine-tuning is not always the best solution for coffee crop segmentation.

Fig. 11: Jaccard score of experiments with Montesanto Coffee task.
Fig. 12: Jaccard score of experiments with Arceburgo Coffee task.
Fig. 13: Jaccard score of experiments with Orange Orchard task.

ProtoSeg had consistent results in most agricultural tasks. For the Coffee tasks, it generally obtained Jaccard scores around , while the performance in the Orange task revolved around . In a tendency similar to the Medical experiments, ProtoSeg seems to benefit from very related/similar tasks, particularly regarding the semantic space. When choosing Orange Orchard as the target task, only the Coffee tasks were available to be used in training, explaining its lower performance in comparison to the Coffee datasets.

V-B Sparse Label Efficiency Comparison

In this section, we present results for three types of sparse annotations: Points, Grid, and Regions. Contour and Skeleton annotations are not evaluated due to our methods to generate them. We define the number of user inputs for a type of annotation as the number of interactions an annotator would have to perform to annotate the image sparsely using said type. For a single image, the number of inputs for a -point Points annotation is : the positive and negative pixels selected. For the Grid annotation, the number of inputs is the total positive labeled pixels in the grid since they are initially assumed to be negative, and the user picks the positive ones. As for the Regions annotation, the number of inputs is defined as the total regions selected, independent of being positive or negative. After the number of inputs for a single image is computed, the values are summed for all images in the support set of the -shot task. Then, the number of inputs of the -shot are averaged across the five folds. Figures 14 and 15 present results label efficiency plots for JSRT Lungs and Montesanto Coffee, respectively. We observe that, as seen in Section V-A, the WeaSeL method overall performs better with more data, that increases with the number of user inputs. We also clearly see how the ProtoSeg method is almost indifferent to the sparsity and quantity of annotations by having a low deviation score in the presented tasks. For the JSRT Lungs task (Figure 14), and Medical tasks, in general, we see that the Grid annotation usually achieve a higher score for the same number of user inputs as the other types of annotation. On the other hand, the Region annotation is commonly the best annotation type for the Remote Sensing tasks, having higher scores with the same number of inputs. The Points annotation is, at most times, the worse performer. This was expected since with the same number of inputs, this type of annotation will have fewer labeled pixels in total than the other types.

Fig. 14: Number of user inputs versus Jaccard score in the JSRT Lungs task.
Fig. 15: Number of user inputs versus Jaccard score in the Montesanto Coffee task.

By comparing these results and the ones of the previous section (Section V-A) we can draw some discussions. The Grid annotation is a solid annotation type that can lead to good results and usually is one of the best types for Medical cases. However, this is the most user-consuming type, requiring a larger number of user inputs. The Regions annotation is also a solid option for annotations. When the superpixel segmentation produces clean regions easier to be labeled, this type can make annotation simpler and quicker and produce precise models, especially for Remote Sensing tasks. The Points annotation requires less from the user. It does not produce the most optimal models but can lead to comparable results requiring far fewer inputs. Also, this annotation guarantees a balanced number of pixels samples for each class in training, making optimization of the models easier. The other two types of annotations, Contours and Skeletons, appear as valid options as well. The way we designed the process of generating these annotations made it difficult to translate to a countable user input, which is why these types are not compared in this section. However, the results presented in Section V-A show that Contours and Skeletons annotations are suitable styles, specially Contours in the Medical tasks and Skeletons in Remote Sensing tasks.

Vi Conclusion and Future Works

In this work, we proposed one method (ProtoSeg) to the problem of weakly supervised few-shot semantic segmentation, also conducting extensive experiments on similar previously proposed method (WeaSeL [gama2021weakly]). Despite being common in shallow interactive segmentation methods, few-shot segmentation from sparse labels is still not fully integrated with the advances in computer vision brought by Deep Learning. We evaluated our two meta-learning methods in a large number of experiments to verify their generalization capabilities in multiple image modalities, number of shots, annotation types and label densities. We chose to focus the experiments in two areas that can benefit from few-shot sparse labeled semantic segmentation: medical imaging and remote sensing. WeaSeL [gama2021weakly], obtained promising results, mainly in scenarios with a large domain shift between the target and source tasks. The proposed ProtoSeg method yielded reliable segmentation predictions in the cases wherein there are multiple closely related source datasets, as good results from ProtoSeg appear to be correlated to the availability of similar tasks during training. The five annotation types evaluated in our experiments — Points, Grid, Contours, Skeletons, and Regions — have their own pros and cons. The Grid annotation proved to be highly reliable and produce some of the best results, even though it often requires more user intervention. Region annotations can be a more efficient option, but its usefulness is highly affected by the performance of the superpixel segmentation algorithm. Points annotations are the less user demanding, but it also yields the greater gaps for dense annotation scores. Contours and Skeletons appear as valid options for medical imaging and remote sensing tasks, respectively. However more experiments must be conducted to confirm their efficiency in comparison to the other label modalities. For future works, we intend to investigate adding spacial reasoning into the segmentation predictions in order to account for the location and feature representations of a given pixel in comparison to annotated pixels. Additionally, further experiments in other medical imaging (e.g. other 2D x-ray exams, volumetric images, etc) and remote sensing (e.g. urban segmentation tasks) will be conducted using both ProtoSeg and WeaSeL. At last, our team shall investigate the need for real annotations at all during the meta-training phase. Instead, we plan to replace the sparse masks for organs and crops by automatically generated weakly supervised masks of regions obtained by shallow unsupervised segmentation algorithms.