Attentional Prototype Inference for Few-Shot Semantic Segmentation

05/14/2021 ∙ by Haoliang Sun, et al. ∙ Shandong University Beihang University IEEE University of Amsterdam 3

This paper aims to address few-shot semantic segmentation. While existing prototype-based methods have achieved considerable success, they suffer from uncertainty and ambiguity caused by limited labelled examples. In this work, we propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot semantic segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. The probabilistic modeling of the prototype enhances the model's generalization ability by handling the inherent uncertainty caused by limited data and intra-class variations of objects. To further enhance the model, we introduce a local latent variable to represent the attention map of each query image, which enables the model to attend to foreground objects while suppressing background. The optimization of the proposed model is formulated as a variational Bayesian inference problem, which is established by amortized inference networks.We conduct extensive experiments on three benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art methods. We also provide comprehensive analyses and ablation studies to gain insight into the effectiveness of our method for few-shot semantic segmentation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Semantic segmentation [5]

has been a fundamental problem in computer vision with widespread application potential in a great variety of areas, e.g., autonomous driving 

[7]

and scene understanding 

[14]

. Existing models based on deep convolutional neural networks and trained on massive amounts of manually labelled images, e.g.,

[6, 30], have obtained impressive results. However, since their performance relies heavily on access to a large number of pixel-wise annotations, it remains challenging to achieve desirable results in practice when the training data is scarce. Therefore, few-shot semantic segmentation [38, 10, 19] has emerged as a popular task to address the annotation scarcity issue of traditional semantic segmentation methods.

Few-shot segmentation generalizes the idea of few-shot classification under the setting of meta-learning. In meta-learning, the dataset is split into meta-training, meta-validation, and meta-testing sets. Few-shot segmentation solutions usually sample an episode consisting of a support and query set from these meta-sets to train a learning procedure that takes the support set as input and produce the prediction on the query set. The model can then achieve effective adaption to new tasks at test time. In this paper, we focus on few-shot semantic segmentation, where we aim to segment the object of an unseen category in a query image with the support of only a few annotated images.

Fig. 1: Deterministic model (previous work) vs.

probabilistic model (this work). The deterministic model embeds the support set into a single deterministic vector as the prototype, measuring the distance between the prototype vector and the feature vectors of pixels in the query image. The deterministic prototype tends to be biased and lacks the ability to represent categorical concepts. The proposed probabilistic model infers the distribution of the prototype, which is treated as a latent variable, from the support set. The probabilistic prototype is more expressive for categorical concepts and endows the model with better generalization to unseen objects.

To alleviate the scarcity of annotated data, most existing works extract supervisory information for the objects from a small set of support images. Inspired by the prototype theory from cognitive science [37, 60] and prototype networks for few-shot classification [42], many semantic segmentation models are designed to learn a prototype vector to represent a category in the support set [32, 56, 10, 49, 40]. The optimization goal is then to obtain a shared feature extractor that generalizes to the segmentation of new objects [10]. While prototype-based methods have shown great efficiency in few-shot segmentation, there still exist three major deficiencies. 1) existing methods map the support images into a deterministic prototype vector, which is often ambiguous and vulnerable to noise under the few-shot setting, especially for one-shot learning tasks. 2) The prototype vector loses the structure information of the object in the query image. 3) A deterministic prototype vector contains only its first-order statistics, which are unable to represent large intra-class variations of objects in the same category. To address these issues, we propose a probabilistic latent variable framework, referred to as attentional prototype inference (API), for few-shot semantic segmentation.

We make three contributions in this work. (1) We provide the first probabilistic framework for few-shot semantic segmentation. We introduce a global latent variable into this model, which represents the prototype of each object category. The probabilistic framework models the prototype as a distribution rather than a vector, making it more robust to noise and better equipped to deal with the ambiguity than the deterministic model. The probabilistic prototype also better represents object categories by effectively capturing intra-class variations. The advantage of probabilistic prototypes over deterministic ones is illustrated in Figure 1

. (2) The second contribution is that we introduce a variational attention mechanism. We define the attention vector as the local latent variable associated with each image and infer its probabilistic distribution, which is jointly estimated within the same framework as the variational prototype inference. The variational attention essentially enables the model to capture the appearance variation of an object. (3) Our third contribution is to formulate the optimization as a variational inference problem to jointly estimate posteriors over latent variables. The optimization objective is built upon a newly derived evidence lower bound (ELBO), which well fits the few-shot segmentation problem well and offers a principled way to model prototypes and attention for few-shot semantic segmentation.

A preliminary conference version of this work was presented in [47]. In this extended version, we add a variational attention mechanism into the probabilistic model to further improve semantic segmentation. In contrast to most previous works [53, 19, 39], we model attention as a latent variable by variational inference, which is seamlessly integrated into the probabilistic framework of prototypes.

To evaluate our attentional prototype inference, we conduct extensive experiments on three benchmarks, i.e., Pascal- [38]

, MS-COCO

[26] and FSS-1000 [51]. The comparison results show that our attentional prototype inference achieves at least competitive and often better performance than state-of-the-art methods on both the 1-shot and 5-shot semantic segmentation tasks, demonstrating its effectiveness for few-shot semantic segmentation. We also conduct ablation studies to gain insight into the proposed attentional prototype inference by demonstrating the benefit of different model components to the overall performance.

2 Related Work

2.1 Many-Shot Semantic Segmentation

Semantic segmentation aims to segment a given image into several pre-defined classes and is often regarded as a pixel-level classification task [5]

. State-of-the-art semantic segmentation methods based on deep convolutional neural networks

[30, 58, 6, 36, 1, 25] have achieved astonishing success. The fully convolutional network (FCN) [30] was the first model to introduce end-to-end convolutional neural networks into segmentation tasks. The essential innovation in FCN is replacing the fully-connected layer with a fully convolutional architecture to preserve the spatial information for better performance. Follow-up efforts have attempted to aggregate multiple pixels to explicitly model context. For example, DeepLab [6] introduces a dilated convolution operation to enlarge the perception field while maintaining the resolution, and PSPNet [57] employs a pyramid pooling module to aggregate multi-scale context information. U-Net [36] adapts the encoder-decoder architecture to merge features from different network layers, while CRF-RNN applies differential conditional random fields (CRFs) [61, 29] to recover detailed visual data structures.

More recent, several approaches tackle semantic segmentation beyond elaborating complex architecture. For instance, Esser et al. [11] introduced a variational encoder to integrate shape information into the image feature encoding. Wang et al. [50]

employed a self-attention mechanism to exchange context between paired pixels so as to capture non-local information. Apart from exploiting more powerful feature representations, another line of works have turned to designing more advanced loss functions, i.e., directly utilizing segmentation structures to replace the cross entropy loss with, for example the Lovasz loss function 

[2] or region mutual information loss [59].

Though they achieve impressive performance, these methods rely heavily on numbers of parameters and labeled training samples with pixel-level annotations. However, pixel-level annotations are expensive and difficult to obtain. Moreover, the deep semantic segmentation models usually perform modestly on new categories of objects that are unseen in the training set, which restricts their use in practical applications.

2.2 Few-Shot Semantic Segmentation

Few-shot semantic segmentation aims to segment images from arbitrary classes by learning transferable knowledge from scarce annotated support images, which has recently gained popularity in computer vision due to its promise in practical applications. Shaban et al. [38] introduced the first few-shot segmentation network based on a two-branch architecture, which uses a support branch to predict the parameters of the last layer of the query branch for segmentation. Recent works [10, 53, 56, 34, 40] also follow this two-branch architecture for few-shot semantic segmentation. Dong and Xing [10] generalized the idea of prototype networks [42] from few-shot recognition for few-shot segmentation. They designed the PLNet, in which the first branch learns a prototype vector that takes images and annotations as input and outputs the prototype. Meanwhile, the second branch takes both a new image and the prototype as input and outputs the segmentation mask. Since then, prototype-based methods have been further developed using different strategies [56, 34, 40, 49, 54]. Rakelly et al. [34] concatenated the pooled support features and the query image to generate the segmentation maps. Zhang et al. [56]

introduced a masked average pooling operation to extract the representative prototype vector from the support images and then estimated the cosine similarity between the extracted vector and the query feature map for predicting the segmentation map. These works have demonstrated the effectiveness of prototype learning for few-shot semantic segmentation. However, a deterministic prototype vector is not sufficiently representative for capturing the categorical concepts of objects, and therefore can cause bias and reduced generalization when objects in the same categories varies. Hu et al.

[19] adopted a spatial attention for the integration of multi-scale context features between support and query images, which highlights context information from several scales. The democratized graph attention mechanism [48] is introduced to establish the robust correspondence between the support set and query images, which is coincident with [27]. To achieve the sufficient representation for the class prototype, Yang et al. [52] designed prototype mixture models to correlate diverse image regions with multiple prototypes. Similar ideas were proposed in [28, 39] at the same period. Tian et al.  [45] proposed the feature enrichment module to overcome spatial in consistency. Other advanced works [13, 44] conjoin the few-shot learning paradigm with instance segmentation. In this work, we develop a variational attention mechanism by placing a distribution over the attention vector, which enables the model to better capture the appearance variation of individual objects.

2.3 Variational Inference

Variational inference (VI) [20, 4] efficiently approximates the posterior densities of unknown quantities by maximizing the evidence lower bound (ELBO) given observations. The variational auto-encoder (VAE) [22, 35] is a generative model that introduces variational inference into the learning of directed graphical models. Sohn et al. [43]

developed the conditional variational autoencoder (C-VAE) by extending VAE into the conditional generative model for supervised learning.

[55, 16, 60] introduced probabilistic models to few-shot learning to handle the uncertainty caused by scarce training data. Finn et al. [16] proposed a probabilistic meta-learning algorithm by extending the model agnostic meta-learning [15] to a probabilistic framework. Their model incorporates a parameter distribution that is trained via a variational lower bound, and handles uncertainty by sampling from this inferred distribution. VI has also been broadly applied in segmentation tasks. For example, Kohl et al. proposed the probabilistic U-net [24] which combines C-VAE with U-Net [36] for medical image segmentation. It learns a distribution over the segmentation masks to handle ambiguities, especially existing in medical images. An extension version [23] decomposed the latent space by using a hierarchical graphical model, which was implemented by the U-Net architecture. Dalca et al. [8] employed an auto-encoding variational CNN to characterize the anatomical prior achieving unsupervised biomedical segmentation. anatomical prior. Zhang et al. [55]

deployed a latent variable to denote the distribution of the entire dataset, which is inferred from the support set. They also showed that their variational learning strategy can be modified to classify proposals for instance segmentation 

[31].

We address few-shot semantic segmentation based on prototypes using a probabilistic latent variable model. We treat the prototype that represents the concept of the object category as a global latent variable, which is modelled as a distribution instead of a single deterministic vector. We further introduce a local latent variable to generate the attention map, which is learned for each image to highlight foreground objects. We solve the whole model by variational Bayesian inference, in which latent variables are jointly learned in the same framework.

3 Methodology

We adopt the meta-learning setting to conduct few-shot segmentation. We learn a segmentation model on the meta-training set and then perform evaluation on the meta-testing set . Different from the traditional semantic segmentation, there is no overlap between the object categories in and . To achieve few-shot semantic segmentation, we follow the episodic paradigm [46] for training and testing under a -shot setting, where denotes the number of training images in an episode. In practice, we sample one episode each time from for training or for evaluation. Each episode is composed of a support set and a query set . Here, denotes the support image with a height of and width of . is its corresponding support mask. Similarly, is the query image and is the associated ground-truth mask of the object to be segmented. The goal of the few-shot segmentation model is to extract transferable knowledge from the support set and apply it to the segmentation of a query image . The predicted segmentation map is denoted as for .

3.1 Attentional Prototype Inference

From a probabilistic perspective, the purpose of few-shot semantic segmentation is to estimate the conditional predictive distribution over the segmentation map for a given query image , when provided the support set .

3.1.1 Probabilistic Modelling

We introduce a latent variable to represent the class prototype, which is conditioned on the corresponding support set. By incorporating the latent variable, we have the conditional predictive log-likelihood as follows:

(1)

where is a conditional prior. The model in (1) provides a probabilistic modelling of prototypes for semantic segmentation, which was introduced in our preliminary work [47]. In this way, our prototype serves as a global representation of an object category while previous ones do not take into account the local spatial structure of the image [10].

To further enhance the model in (1), we introduce a local latent variable to represent the attention map associated with each image, which highlights the foreground object. The conditional predictive log-likelihood with respect to the two latent variables takes the following form:

(2)

where we also deploy a conditional prior for , since the attention maps should be specific for each individual query image .

However, these posteriors are intractable in practice. Thus, we introduce the variational posterior to approximate the true posteriors by minimizing their Kullback-Leibler (KL) divergence. We employ the variational distributions and for the prototype and the attention map , respectively. By incorporating the variational posteriors into the conditional log-likelihood of (2), we arrive at:

(3)

Applying Jensen’s inequality gives rise to the ELBO as follows:

(4)

The first term of the ELBO is the expectation of the log-likelihood of the conditional generative distribution based on the inferred prototypes and attention maps . The second term is the KL divergence between the estimated posterior distribution and the prior distribution . Minimizing this term encourages the model to leverage the object information for segmentation of the query image. Minimizing the third term of the KL divergence enables the model to generate attention maps that highlight the foreground object. We derive the optimization objective based on the ELBO. A graphical illustration of the attentional prototype inference is shown in Figure 2.

Fig. 2: Illustration of attentional prototype inference in a directed graphical model. and are the support and query set of an episode. is the image and mask pair from the query set. is the prototype associated with each episode and is the attention variable associated with each image. The dashed line indicates the inference operation, while the solid line denotes the conditional relationship between variables.

3.1.2 Optimization Objective

Maximizing the ELBO can yield accurate predictions for the segmentation masks and narrow the gap between the posterior and the prior distributions. This encourages 1) the inferred prototype from the full support dataset to match the one from the support images by minimizing the first KL term; and 2) the inferred map from the query set to approach the one based merely on the query image by minimizing the second KL term. Based on the ELBO, we define the empirical objective function for optimization.

Given a batch of episodes, the empirical objective for stochastic optimization with the Monte Carlo estimates of expectations is as follows:

(5)

where indexes over the sampled episode in the meta-training set , and are the variables sampled from their variational distribution.

are the number of samples. Generally, the parameters of a neural network are optimized jointly with stochastic gradient descent. However, it is usually intractable to calculate the gradient of the sampling operation. Therefore, we deploy the reparameterization trick

[22] to handle the non-differentiable problem of the sampling process. Specifically, supposing the two variational posterior distributions take the form of a multivariate Gaussian with a diagonal covariance, the sampling process is formulated as:

(6)

During training, the samples of the class prototype are obtained by:

(7)

where denotes an element-wise multiplication and . The same operation is also deployed for sampling .

The first term of the empirical loss in (5) is implemented as a least square loss in [22]; we generalize this to a pixel-wise cross-entropy loss to penalize the difference between the predicted segmentation map and the ground truth . The number of samples and are set to 1 during training to speed up the learning process. Since the KL terms minimize the discrepancy between two distributions, the prior networks can mimic the behavior of the posterior networks that produce effective prototypes or attention maps at training time.

Fig. 3: Attentional prototype inference for one-shot semantic segmentation, implemented as the amortized neural network with an auto-encoder architecture. The prior nets and the posterior nets map the input images into the latent space obtaining the prior distributions and the posteriors . The segmentation net takes the query image , the sampled prototype vector and the attention map to generate a distribution of the segmentation map: . Monte Carlo estimation is employed to predict an accurate segmentation maps by averaging multiple outputs.

3.1.3 Segmentation Map Inference

The inference of segmentation maps varies between learning and the inference stages. At test time, instead of sampling from the variational posterior distributions, we draw samples of prototypes from the prior and samples of attention vectors from . is obtained by taking the average of segmentation maps from these samples:

(8)

where

(9)

and

(10)

For the

-shot setting, we adopt a variance-weighted average strategy

[47] that generates a prior for each of the support pairs and aggregates those priors with the weight of the variance.

3.2 Neural Networks Implementation

We implement our attentional prototype inference with neural networks using the amortization technique [22], which is seamlessly integrated into the autoencoder architecture, as shown in Figure 3

. We parameterize the distributions as factorized Gaussian distributions with diagonal covariance matrices.

Prior Networks. The prior network for prototypes embeds the support set into a function space, where the conditional prior distribution lies. The prior network for attention maps are encouraged to generate an effective attention map as

. For the prior network of prototypes, we construct a multi-layer perceptron (MLP)

with three fully connected layers. The extracted deep features of images are selected with their segmentation masks to obtain the foreground features. A permutation-invariant pooling layer 

[40] then squeezes them into a single vector . In this work, we assume that the prior follows a diagonal covariance Gaussian distribution. Given the single feature vector, the mean and variance w.r.t. come from the output of the MLP:

(11)

The main spirit behind the prior network of the attention maps is similar but we employ a transformer architecture [3] to extract the structure information with high fidelity. This prior transformer contains a pixel-level self-attention and aggregates all pixels into a vector. Then a two-layer perceptron is concatenated to the output and w.r.t. . The sampled is directly used for computing the attention map.

Posterior Networks. As demonstrated in Figure 3, the posterior network for prototypes learns to approximate its true posterior distribution given a pair of query samples and the support set . The posterior network for attention maps is trained in a similar way but using only the query pair. The posterior networks have the same architecture as the prior network of prototypes. However, their outputs come from the aggregation of outputs for since it has pairs of inputs. Here, the posterior network generates an attention vector about all the input pairs. We then compute the cosine distance between the attention vector and feature embedding at the pixel level to construct an attention map :

(12)

Segmentation Network. Finally, the segmentation network takes the query image , and the sampled prototype vector , and estimates the attention maps as inputs to predict the segmentation map , which is the Monte Carlo estimation of the conditional generative distribution . Once we sample an attention map from this distribution, we multiply it by the deep feature of the query image to enhance a structured embedding. The segmentation net concatenates the attentive embedding and the prototype vector sampled from the prior (see Figure 6) together and produces the output segmentation map:

(13)

At testing time, API generates multiple samples for the prototype and the attention map. This achieves a more accurate prediction using the ensemble of all outputs:

(14)

The segmentation network adopts a multi-layer skip-connections structure [36] to incorporate more spatial information. Besides, there is a CNN-based encoder for the feature embedding shared by the prior, posterior, and segmentation networks. All networks are jointly optimized end-to-end by minimizing the objective (5). The source code will be made publicly available.

4 Experiments

1-shot 5-shot
Backbone fold-0 fold-1 fold-2 fold-3 mean fold-0 fold-1 fold-2 fold-3 mean
OSLSM [38] VGG16 33.6 55.3 40.9 33.5 40.8 35.9 58.1 42.7 39.1 43.9
Co-FCN [34] VGG16 36.7 50.6 44.9 32.4 41.1 37.5 50.0 44.1 33.9 41.4
AMP [40] VGG16 41.9 50.2 46.7 34.7 43.4 40.3 55.3 49.9 40.1 46.4
SG-One [56] VGG16 40.2 58.4 48.4 38.4 46.3 41.9 58.6 48.6 39.4 47.1
PANet [49] VGG16 42.3 58.0 51.1 41.2 48.1 51.8 64.6 59.8 46.5 55.7
CANet [54] ResNet50 52.5 65.9 51.3 51.9 55.4 55.5 67.8 51.9 53.2 57.1
PGNet [53] ResNet50 56.0 66.9 50.6 50.4 56.0 57.7 68.7 52.9 54.6 58.5
PMM [52] ResNet50 55.2 66.9 52.6 50.7 56.3 56.3 67.3 54.5 51.0 57.3
CRNet [27] ResNet50 - - - - 55.7 - - - - 58.8
FWB [32] ResNet101 51.3 64.5 56.7 52.2 56.2 54.8 67.4 62.2 55.3 59.9
API (This paper) ResNet101 54.4 67.1 53.8 54.1 57.4 60.1 68.5 55.6 58.7 60.7
TABLE I: Comparison with state-of-the-art in terms of Class-IoU on PASCAL-.
Backbone 1-shot 5-shot
Co-FCN [34] VGG16 60.1 60.2
AMP [40] VGG16 60.1 62.1
PL+SEG [10] VGG16 61.2 62.3
A-MCG[19] VGG16 61.2 62.2
OSLSM [38] VGG16 61.3 61.5
SG-One [56] VGG16 63.9 65.9
PANet [49] VGG16 66.5 70.7
CANet [54] ResNet50 66.2 69.6
PGNet [53] ResNet50 69.9 70.5
API (This paper) ResNet101 71.3 73.2
TABLE II: Comparison with the state-of-the-arts in terms of Binary-IoU on PASCAL-.

4.1 Datasets and Implementation Details

We conduct experiments on three commonly used few-shot semantic segmentation benchmarks including PASCAL- [38], COCO- [19] and FSS-1000 [51]. We provide detailed descriptions of these datasets associated experimental settings as follows.

4.1.1 Datasets

PASCAL- originates from PASCAL VOC12 [12] and extends annotations from SDS [17]. We follow the settings in [38], splitting the 20 original classes in four folds and conducting cross-validation among them. Specifically, we select 15 classes for training, while the remaining 5 classes are for testing. For a fair comparison, we adopt the same strategy as [52], randomly sampling 1,000 episodes of support-query pairs for evaluation.

COCO- [19] is a challenging dataset built upon MS-COCO [26] with 80 object categories. We also divide the 80 classes in MS-COCO into four folds and conduct four-fold cross-validation. Under the same settings as PASCAL-, 60 object categories are selected for training, while the remaining 20 categories are used for testing. In each fold, we sample 1000 support-query pairs from the 20 testing classes for evaluation, following [52].

FSS-1000 [51] is a specialized few-shot segmentation dataset. It contains 1,000 object categories including 520 classes for training, 240 classes for validation, and 240 classes for testing. Following [51], we choose the same 240 categories for testing and train the model on the specified 520 classes. The number of testing episodes is 1,000.

4.1.2 Implementation Details

We adopt a ResNet101 [18]

backbone pre-trained on ImageNet

[9] as the encoder. The decoder is designed as a skip-connection structure [36], which is composed of three convolutional blocks to generate segmentation maps. Each block receives the input of the concatenation with the corresponding encoded feature through the skip connections and the decoding features. Data augmentation is applied to the support and query images in the training stage. The augmentation operations include Gaussian blurring, randomly cropping, horizontal flipping, rotation, and scaling. We choose Adam [21] as the optimizer and train the model on an NVIDIA Tesla V100 with around 40,000 iterations. The learning rate is fixed to for the backbone and

for other layers, and the batch normalization (BN) layers are frozen during training. The numbers of samples

and are set to 15 during the test phase, which is analyzed in detail by our ablation study in Section 4.3.1.

We adopt the same metrics as [38, 34] for evaluation, i.e. Class-IoU and Binary-IoU. Class-IoU measures the intersection-over-union for each class, where TP, FP and FN are the number of pixels that are true positives, false positives and false negatives of the predicted segmentation masks for each foreground category . Binary-IoU measures the IoU between the foreground and background pixels, where all object classes are treated as foreground.

Fig. 4: Visualization of 1-shot segmentation results on PASCAL-. From top to bottom are support image, ground-truth and prediction. Our method achieves accurate segmentation maps for query images, even in cases of considerable variation of objects from support to query images.
Fig. 5: Visualization of 1-shot segmentation results on COCO-. Our method successfully predicts the segmentation maps for query images, though the objects in the query image are considerably different from those in the support images in terms of appearance, size and viewpoints.
Fig. 6: Visualization of 1-shot segmentation results on FSS-1000. Our method is able to accurately segment objects in challenging scenarios, including those with variants in background and appearance.

4.2 Comparison with State-of-the-Arts

4.2.1 Performance on PASCAL-

In Table I, we compare the performance of API with the state-of-the-arts on PASCAL- in terms of the Class-IoU metric. API outperforms other methods by good margins under both the 1-shot and 5-shot settings (57.4%, 60.7%). The performance improvement of API under the 1-shot setting, which is more challenging than the 5-shot setting due to the much larger intra-class variation. The Monte Carlo estimation in our probabilistic model serves as an ensemble of the prediction results. This accounts for the robustness of our API in the 1-shot case. We also evaluate the model in terms of Binary-IoU in Table II. Our model again yields state-of-the-art performance under both the 1-shot and 5-shot settings of and .

Some qualitative results on PASCAL- are visualized in Figure 4. The proposed API is capable of producing more accurate segmentation under various challenging scenarios, where the query images vary in appearance and object size from the associated support images. For instance, in the fifth column, the size and viewpoint of the bus in the query image is significantly different from the annotated plane in the support image; in the eighth column, the annotated boat in the support image is much smaller than the one the query image.

Class-IoU Binary-IoU
Backbone 1-shot 5-shot 1-shot 5-shot
PANet [49] VGG16 20.9 29.7 59.2 63.0
PMM [52] ResNet50 30.6 35.5 - -
A-MCG [19] ResNet101 - - 52.0 54.7
FWB [32] ResNet101 21.2 23.7 - -
API (This paper) ResNet101 31.2 35.9 61.4 62.3
TABLE IV: Comparison of two metrics on FSS-1000.
Positive-IoU
Backbone 1-shot 5-shot
OSLSM [38] VGG16 70.3 73.0
Co-FCN [33] VGG16 71.9 74.3
FSS-1000 [51] VGG16 73.5 80.1
API (This paper) VGG16 83.4 85.3
API (This paper) ResNet101 85.6 88.0
TABLE III: Comparison of two metrics on COCO-.
VGG ResNet50 ResNet101
Class-IoU Binary-IoU Class-IoU Binary-IoU Class-IoU Binary-IoU
k-shot 1 5 1 5 1 5 1 5 1 5 1 5
Deterministic model 51.6 53.1 64.1 65.3 54.1 57.4 65.7 68.9 55.4 58.7 66.3 69.9
API (This paper) 52.7 55.6 65.2 66.0 55.9 59.8 69.3 71.7 57.4 60.7 71.3 73.2
TABLE V: Benefit of probabilistic modeling on PASCAL-. The proposed probabilistic modeling shows consistent advantages over deterministic models in terms of different metrics, with different backbone networks and under both the 1-shot and 5-shot settings.
Fig. 7: Visualization of the attention maps. The foreground object regions are highlighted, which helps achieve better prediction of segmentation maps.
PASCAL-5 1-shot C-IoU PASCAL-5 5-shot C-IoU FSS 1-shot
fold-0 fold-1 fold-2 fold-3 mean B-IoU fold-0 fold-1 fold-2 fold-3 mean B-IoU P-IoU
VPI 53.4 65.6 57.3 52.9 57.3 70.3 55.8 67.5 62.6 55.7 60.4 72.1 84.3
API 54.4 67.1 53.8 54.1 57.4 71.3 60.1 68.5 55.6 58.7 60.7 73.2 85.6
COCO-20 1-shot C-IoU COCO-20 5-shot C-IoU FSS 5-shot
fold-0 fold-1 fold-2 fold-3 mean B-IoU fold-0 fold-1 fold-2 fold-3 mean B-IoU P-IoU
VPI 24.9 25.8 21.9 21.1 23.4 61.1 27.8 29.3 24.7 29.4 27.8 63.0 87.7
API 33.9 46.6 24.8 21.4 31.2 61.4 35.7 47.5 31.6 28.8 35.9 62.3 88.0
TABLE VI: Comparison with VPI [47] on three benchmarks.
Fig. 8: The attention mechanism increases the inference time slightly, less than 0.1 second under the 5-shot setting with .
Fig. 9: Effect of Monte Carlo Sampling. Segmentation maps produced by sampled individual prototypes tend to be noisy, but the final segmentation maps by the aggregated prototype have less noise.

4.2.2 Performance on COCO-

COCO- is more challenging than PASCAL- since the scenes in COCO- are more complex with more intra-class diversity. Therefore, few-shot segmentation on COCO- has more ambiguity and it is difficult to acquire an effective class-specific prototype. As can be seen in Table IV, our method outperforms the state-of-the-art PMM [52] by and in terms of Class-IoU under the 1-shot and 5-shot setting. The qualitative results are provided in Figure 5. Our method successfully predicts the segmentation maps for query images, though the objects in the query image are significantly different from those in the support images in terms of appearance, size and viewpoints. As shown in the third column, the object can also be successfully segmented even with serve occlusions.

4.2.3 Performance on FSS-1000

We evaluate our method following the official evaluation protocols in [51]

. The evaluation metric used for FSS-1000 is the IoU of positive labels in a binary segmentation map. A performance comparison with other models in terms of Positive-IoU is provided in Table 

IV. Our method improves over the state-of-the-art set by Wei et al. [51] by and in the 1-shot and 5-shot settings, demonstrating the effectiveness of our proposal for few-shot semantic segmentation across large-scale semantic categories. Figure 6 visualizes segmentation results on the FSS-1000 dataset, where API produces accurate segmentation maps close to the ground truth.

4.3 Ablation Study

4.3.1 Benefit of Probabilistic Modeling

Different from previous deterministic models by learning a deterministic prototype vector, our probabilistic methods infers the distributions of the class prototype and the attention vector for each image. To demonstrate the advantage of the proposed probabilistic modeling, we implement a deterministic counterpart. We utilize the same network architecture for fair comparison and predict a deterministic class prototype vector or an attention map by the branch and remove the KL divergence term during training. We implement both models with a VGG-16 [41], ResNet50 [18], and ResNet101 [18] backbone, which are commonly adopted in previous works [32, 53].

The results on PASCAL- are shown in Table V. Our attentional prototype inference consistently achieves better performance than the deterministic models under both the and -shot settings in terms of Class-IoU and the Binary-IoU metrics, because the proposed probabilistic modeling of prototypes and attention maps is more expressive of object classes and has a compelling capability of capturing the categorical concepts of objects. Therefore, the learned model is endowed with a stronger generalization ability for query images that usually exhibits large variations. The results also illustrate the advantage of probabilistic modeling for few-shot semantic segmentation. As the ResNet101 backbone outperforms both VGG16 and ResNet50, we adopt ResNet101 as the backbone network in our experiments.

(a) Class mean IoU for 1-shot
(b) Class mean IoU for 5-shot
(c) Binary IoU for 1-shot
(d) Binary IoU for 5-shot
Fig. 10: Effect of the number of samples and . Class IoU and Binary Iou metrics improve as and increase. We consider to achieve a good trade-off between performance and inference time.

4.3.2 Effectiveness of Latent Attention Mechanism

The newly introduced attention mechanism faithfully leverages the structure information that is neglected by the prototype, which is essential for the mask prediction. The transformer architecture adopted in our model can utilize the context knowledge from pixels with high fidelity, and enhances the foreground to achieve accurate prediction. We conduct extensive comparison experiments with the baseline variational prototype inference (VPI) [47] to show the effectiveness of the latent attention mechanism. As shown in Table VI, we evaluate their performance on three benchmarks with three metrics. API achieves considerable improvement compared with VPI in most cases. In particular, API improves VPI by under the 1-shot, and under the 5-shot settings for Class-IoU on COCO-20. The performance advantage of API in Binary-IoU compared to VPI is relatively smaller than in Class-IoU on COCO-20. This is due to the bias of Binary-IoU towards objects that cover a large part of the foreground and background areas.

We also visualize the attention maps computed by our API model in Figure 7. The pixel on the foreground object is highlighted to enhance the prediction of the prediction maps. The proposed latent attention module can capture the outline of the foreground objects. Besides, this module merely increases a little inference time. As evidenced in Figure 9, when we fix and increase from 1 to 15, the extra time introduced by the attention module is less than 0.1 second even under 5-shot setting. These observations indicate the advantage of our strategy for computational efficiency.

4.3.3 Effect of Monte Carlo Sampling

The segmentation map is estimated by Monte Carlo sampling that obtain multiple prototypes z and attention maps m, and produce multiple outputs, then, all outputs are aggregated to produce the final segmentation map. We conduct qualitative analysis of the effect of the Monte Carlo sampling on the segmentation results. As shown in Figure 9, the segmentation map for each sampled prototype is not always adequate. For example, in the first row, the segmentation map generated by 3 times sampling does not completely recover the object. By averaging the segmentation maps produced by the individual samples, the final segmentation map tends to be more complete and robust.

The quantitative results in Figure 10 show that the prediction result turns to be better as the number of samples increasing. Despite the fact that the segmentation results are more accurate given more samples, it will take more time for inference. We observe that the performance tends to saturate when reach . Therefore, in our experiments, we set to 15 during inference to achieve precise segmentation maps with an acceptable inference time.

5 Conclusion

This paper tackle few-shot semantic segmentation from a probabilistic perspective. We define a global latent variable to represent the prototype of each object category, which is inferred from the data. We also incorporate a local latent variable to represent an attention map for each individual image, which highlights the foreground object for improved segmentation. We develop attentional prototype inference (API) to leverage variational inference for efficient optimization. By probabilistic modeling, API is capable to enhances its generalization ability by handling the inherent uncertainty caused by limited data and the intra-class variations of objects, which is essential for generalizing to new unseen categories. The local latent attention mechanism enables the model to attend to foreground objects while suppressing background. We conduct comprehensive experiments on three benchmark datasets. The results illustrate the effectiveness of the proposed method. The ablation studies further demonstrate the benefit of our attentional prototype inference by probabilistic modeling.

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: §2.1.
  • [2] M. Berman, A. R. Triki, and M. B. Blaschko (2018) The lovász-softmax loss: a tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In

    Computer Vision and Pattern Recognition

    ,
    pp. 4413–4421. Cited by: §2.1.
  • [3] G. Bhat, F. J. Lawin, M. Danelljan, A. Robinson, M. Felsberg, L. Van Gool, and R. Timofte (2020) Learning what to learn for video object segmentation. In European Conference on Computer Vision, pp. 777–794. Cited by: §3.2.
  • [4] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §2.3.
  • [5] G. J. Brostow, J. Fauqueur, and R. Cipolla (2009) Semantic object classes in video: a high-definition ground truth database. Pattern Recognition Letters 30 (2), pp. 88–97. Cited by: §1, §2.1.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Cited by: §1, §2.1.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Computer Vision and Pattern Recognition, pp. 3213–3223. Cited by: §1.
  • [8] A. V. Dalca, J. Guttag, and M. R. Sabuncu (2018) Anatomical priors in convolutional networks for unsupervised biomedical segmentation. In Computer Vision and Pattern Recognition, pp. 9290–9299. Cited by: §2.3.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §4.1.2.
  • [10] N. Dong and E. Xing (2018) Few-shot semantic segmentation with prototype learning.. In British Machine Vision Conference, Cited by: §1, §1, §2.2, §3.1.1, TABLE II.
  • [11] P. Esser, E. Sutter, and B. Ommer (2018) A variational u-net for conditional appearance and shape generation. In Computer Vision and Pattern Recognition, pp. 8857–8866. Cited by: §2.1.
  • [12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §4.1.1.
  • [13] Z. Fan, J. Yu, Z. Liang, J. Ou, C. Gao, G. Xia, and Y. Li (2020) Fgn: fully guided network for few-shot instance segmentation. In Computer Vision and Pattern Recognition, pp. 9172–9181. Cited by: §2.2.
  • [14] C. Farabet, C. Couprie, L. Najman, and Y. LeCun (2012) Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1915–1929. Cited by: §1.
  • [15] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    International Conference on Machine Learning

    ,
    pp. 1126–1135. Cited by: §2.3.
  • [16] C. Finn, K. Xu, and S. Levine (2018) Probabilistic model-agnostic meta-learning. In Neural Information Processing Systems, pp. 9516–9527. Cited by: §2.3.
  • [17] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In International Conference on Computer Vision, pp. 991–998. Cited by: §4.1.1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, Cited by: §4.1.2, §4.3.1.
  • [19] T. Hu, P. Yang, C. Zhang, G. Yu, Y. Mu, and C. G. M. Snoek (2019) Attention-based multi-context guiding for few-shot semantic segmentation. In AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §2.2, §4.1.1, §4.1, TABLE II, TABLE IV.
  • [20] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul (1999) An introduction to variational methods for graphical models. Machine Learning 37 (2), pp. 183–233. Cited by: §2.3.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.2.
  • [22] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.3, §3.1.2, §3.1.2, §3.2.
  • [23] S. A. Kohl, B. Romera-Paredes, K. H. Maier-Hein, D. J. Rezende, S. Eslami, P. Kohli, A. Zisserman, and O. Ronneberger (2019) A hierarchical probabilistic u-net for modeling multi-scale ambiguities. arXiv preprint arXiv:1905.13077. Cited by: §2.3.
  • [24] S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J. R. Ledsam, K. Maier-Hein, S. A. Eslami, D. J. Rezende, and O. Ronneberger (2018) A probabilistic u-net for segmentation of ambiguous images. In Neural Information Processing Systems, pp. 6965–6975. Cited by: §2.3.
  • [25] G. Lin, A. Milan, C. Shen, and I. Reid (2017) Refinenet: multi-path refinement networks for high-resolution semantic segmentation. In Computer Vision and Pattern Recognition, pp. 1925–1934. Cited by: §2.1.
  • [26] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, pp. 740–755. Cited by: §1, §4.1.1.
  • [27] W. Liu, C. Zhang, G. Lin, and F. Liu (2020) Crnet: cross-reference networks for few-shot segmentation. In Computer Vision and Pattern Recognition, pp. 4165–4173. Cited by: §2.2, TABLE I.
  • [28] Y. Liu, X. Zhang, S. Zhang, and X. He (2020) Part-aware prototype network for few-shot semantic segmentation. In European Conference on Computer Vision, pp. 142–158. Cited by: §2.2.
  • [29] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang (2017) Deep learning markov random field for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (8), pp. 1814–1828. Cited by: §2.1.
  • [30] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, pp. 3431–3440. Cited by: §1, §2.1.
  • [31] C. Michaelis, M. Bethge, and A. S. Ecker (2018) One-shot segmentation in clutter. International Conference on Machine Learning. Cited by: §2.3.
  • [32] K. Nguyen and S. Todorovic (2019) Feature weighting and boosting for few-shot segmentation. In International Conference on Computer Vision, pp. 622–631. Cited by: §1, §4.3.1, TABLE I, TABLE IV.
  • [33] K. Rakelly, E. Shelhamer, T. Darrell, A. A. Efros, and S. Levine (2018) Few-shot segmentation propagation with guided networks. arXiv preprint arXiv:1806.07373. Cited by: TABLE IV.
  • [34] K. Rakelly, E. Shelhamer, T. Darrell, A. Efros, and S. Levine (2018) Conditional networks for few-shot semantic segmentation. In International Conference on Learning Representations Workshop, Cited by: §2.2, §4.1.2, TABLE I, TABLE II.
  • [35] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    .
    In International Conference on Machine Learning, pp. 1278–1286. Cited by: §2.3.
  • [36] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 234–241. Cited by: §2.1, §2.3, §3.2, §4.1.2.
  • [37] E. H. Rosch (1973) Natural categories. Cognitive psychology 4 (3), pp. 328–350. Cited by: §1.
  • [38] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots (2017) One-shot learning for semantic segmentation. British Machine Vision Conference. Cited by: §1, §1, §2.2, §4.1.1, §4.1.2, §4.1, TABLE I, TABLE II, TABLE IV.
  • [39] M. Siam, N. Doraiswamy, B. N. Oreshkin, H. Yao, and M. Jagersand (2020) Weakly supervised few-shot object segmentation using co-attention with visual and semantic embeddings. arXiv e-prints, pp. arXiv–2001. Cited by: §1, §2.2.
  • [40] M. Siam, B. N. Oreshkin, and M. Jagersand (2019) AMP: adaptive masked proxies for few-shot segmentation. In International Conference on Computer Vision, pp. 5249–5258. Cited by: §1, §2.2, §3.2, TABLE I, TABLE II.
  • [41] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.3.1.
  • [42] J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Neural Information Processing Systems, pp. 4077–4087. Cited by: §1, §2.2.
  • [43] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Neural Information Processing Systems, pp. 3483–3491. Cited by: §2.3.
  • [44] P. Tian, Z. Wu, L. Qi, L. Wang, Y. Shi, and Y. Gao (2020) Differentiable meta-learning model for few-shot semantic segmentation. In AAAI Conference on Artificial Intelligence, pp. 12087–12094. Cited by: §2.2.
  • [45] Z. Tian, H. Zhao, M. Shu, Z. Yang, R. Li, and J. Jia (2020) Prior guided feature enrichment network for few-shot segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.2.
  • [46] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Neural Information Processing Systems, pp. 3630–3638. Cited by: §3.
  • [47] H. Wang, Y. Yang, X. Cao, X. Zhen, C. Snoek, and L. Shao (2021) Variational prototype inference for few-shot semantic segmentation. In Winter Conference on Applications of Computer Vision, pp. 525–534. Cited by: §1, §3.1.1, §3.1.3, §4.3.2, TABLE VI.
  • [48] H. Wang, X. Zhang, Y. Hu, Y. Yang, X. Cao, and X. Zhen (2020) Few-shot semantic segmentation with democratic attention networks. In European Conference on Computer Vision, Cited by: §2.2.
  • [49] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng (2019) PANet: few-shot image semantic segmentation with prototype alignment. In International Conference on Computer Vision, pp. 9197–9206. Cited by: §1, §2.2, TABLE I, TABLE II, TABLE IV.
  • [50] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §2.1.
  • [51] T. Wei, X. Li, Y. P. Chen, Y. Tai, and C. Tang (2020) FSS-1000: a 1000-class dataset for few-shot segmentation. Computer Vision and Pattern Recognition. Cited by: §1, §4.1.1, §4.1, §4.2.3, TABLE IV.
  • [52] B. Yang, C. Liu, B. Li, J. Jiao, and Q. Ye (2020) Prototype mixture models for few-shot semantic segmentation. In European Conference on Computer Vision, pp. 763–778. Cited by: §2.2, §4.1.1, §4.1.1, §4.2.2, TABLE I, TABLE IV.
  • [53] C. Zhang, G. Lin, F. Liu, J. Guo, Q. Wu, and R. Yao (2019) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Computer Vision and Pattern Recognition, pp. 9587–9595. Cited by: §1, §2.2, §4.3.1, TABLE I, TABLE II.
  • [54] C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen (2019) CANet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Computer Vision and Pattern Recognition, pp. 5217–5226. Cited by: §2.2, TABLE I, TABLE II.
  • [55] J. Zhang, C. Zhao, B. Ni, M. Xu, and X. Yang (2019) Variational few-shot learning. In International Conference on Computer Vision, pp. 1685–1694. Cited by: §2.3.
  • [56] X. Zhang, Y. Wei, Y. Yang, and T. Huang (2018) Sg-one: similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091. Cited by: §1, §2.2, TABLE I, TABLE II.
  • [57] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §2.1.
  • [58] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §2.1.
  • [59] S. Zhao, Y. Wang, Z. Yang, and D. Cai (2019) Region mutual information loss for semantic segmentation. arXiv preprint arXiv:1910.12037. Cited by: §2.1.
  • [60] X. Zhen, Y. Du, H. Xiong, Q. Qiu, C. Snoek, and L. Shao (2020) Learning to learn variational semantic memory. Neural Information Processing Systems. Cited by: §1, §2.3.
  • [61] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. Torr (2015)

    Conditional random fields as recurrent neural networks

    .
    In International Conference on Computer Vision, pp. 1529–1537. Cited by: §2.1.