Integrative Few-Shot Learning for Classification and Segmentation

03/29/2022
by   Dahyun Kang, et al.
POSTECH
0

We introduce the integrative task of few-shot classification and segmentation (FS-CS) that aims to both classify and segment target objects in a query image when the target classes are given with a few examples. This task combines two conventional few-shot learning problems, few-shot classification and segmentation. FS-CS generalizes them to more realistic episodes with arbitrary image pairs, where each target class may or may not be present in the query. To address the task, we propose the integrative few-shot learning (iFSL) framework for FS-CS, which trains a learner to construct class-wise foreground maps for multi-label classification and pixel-wise segmentation. We also develop an effective iFSL model, attentive squeeze network (ASNet), that leverages deep semantic correlation and global self-attention to produce reliable foreground maps. In experiments, the proposed method shows promising performance on the FS-CS task and also achieves the state of the art on standard few-shot segmentation benchmarks.

READ FULL TEXT VIEW PDF

Authors

page 1

page 6

page 14

page 15

page 16

03/06/2019

CANet: Class-Agnostic Segmentation Networks with Iterative Refinement and Attentive Few-Shot Learning

Recent progress in semantic segmentation is driven by deep Convolutional...
03/02/2017

Attentive Recurrent Comparators

Rapid learning requires flexible representations to quickly adopt to new...
12/16/2021

Semantic-Based Few-Shot Learning by Interactive Psychometric Testing

Few-shot classification tasks aim to classify images in query sets based...
04/19/2021

SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive Background Prototypes

Few-shot semantic segmentation aims to segment novel-class objects in a ...
12/13/2020

Pseudo Shots: Few-Shot Learning with Auxiliary Data

In many practical few-shot learning problems, even though labeled exampl...
04/06/2019

Few-Shot Learning via Saliency-guided Hallucination of Samples

Learning new concepts from a few of samples is a standard challenge in c...
04/24/2022

Realistic Evaluation of Transductive Few-Shot Learning

Transductive inference is widely used in few-shot learning, as it levera...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Few-shot learning [fei2006one, fink2005object, wu2010towards, lake2015human, wang2020generalizing]

is the learning problem where a learner experiences only a limited number of examples as supervision. In computer vision, it has been most actively studied for the tasks of image classification 

[alexnet, vgg, resnet] and semantic segmentation [deeplab, fcn, deconvnet, unet] among many others [han2021query, ojha2021few, ramon2021h3d, yue2021prototypical, zhao2021few]. Few-shot classification (FS-C) aims to classify a query image into target classes when a few support examples are given for each target class. Few-shot segmentation (FS-S) is to segment out the target class regions on the query image in a similar setup. While being closely related to each other [li2009towards, yao2012describing, zhou2019collaborative], these two few-shot learning problems have so far been treated individually. Furthermore, the conventional setups for the few-shot problems, FS-C and FS-S, are limited and do not reflect realistic scenarios; FS-C [matchingnet, ravi2016optimization, koch2015siamese] presumes that the query always contains one of the target classes in classification, while FS-S [shaban2017oslsm, rakelly2018cofcn, hu2019amcg] allows the presence of multiple classes but does not handle the absence of the target classes in segmentation. These respective limitations prevent few-shot learning from generalizing to and evaluating on more realistic cases in the wild. For example, when a query image without any target class is given as in Fig. 1, FS-S learners typically segment out arbitrary salient objects in the query.

To address the aforementioned issues, we introduce the integrative task of few-shot classification and segmentation (FS-CS) that combines the two few-shot learning problems into a multi-label and background-aware prediction problem. Given a query image and a few-shot support set for target classes, FS-CS aims to identify the presence of each target class and predict its foreground mask from the query. Unlike FS-C and FS-S, it does not presume either the class exclusiveness in classification or the presence of all the target classes in segmentation.

As a learning framework for FS-CS, we propose integrative few-shot learning (iFSL) that learns to construct shared foreground maps for both classification and segmentation. It naturally combines multi-label classification and pixel-wise segmentation by sharing class-wise foreground maps and also allows to learn with class tags or segmentation annotations. For effective iFSL, we design the attentive squeeze network

(ASNet) that computes semantic correlation tensors between the query and the support image features and then transforms the tensor into a foreground map by strided self-attention. It generates reliable foreground maps for iFSL by leveraging multi-layer neural features 

[hpf, hsnet] and global self-attention [transformers, vit]. In experiments, we demonstrate the efficacy of the iFSL framework on FS-CS and compare ASNet with recent methods [xie2021few, wu2021learning, hsnet, xie2021scale]. Our method significantly improves over the other methods on FS-CS in terms of classification and segmentation accuracy and also outperforms the recent FS-S methods on the conventional FS-S. We also cross-validate the task transferability between the FS-C, FS-S, and FS-CS learners, and show the FS-CS learners effectively generalize when transferred to the FS-C and FS-S tasks.

Our contribution is summarized as follows:

  • We introduce the task of integrative few-shot classification and segmentation (FS-CS), which combines few-shot classification and few-shot segmentation into an integrative task by addressing their limitations.

  • We propose the integrative few-shot learning framework (iFSL), which learns to both classify and segment a query image using class-wise foreground maps.

  • We design the attentive squeeze network (ASNet), which squeezes semantic correlations into a foreground map for iFSL via strided global self-attention.

  • We show in extensive experiments that the framework, iFSL, and the architecture, ASNet, are both effective, achieving a significant gain on FS-S as well as FS-CS.

2 Related work

Few-shot classification (FS-C)

. Recent FS-C methods typically learn neural networks that maximize positive class similarity and suppress the rest to predict the most probable class. Such a similarity function is obtained by a) meta-learning embedding functions 

[koch2015siamese, matchingnet, protonet, allen2019infinite, tewam, can, feat, deepemd, renet], b) meta-learning to optimize classifier weights [maml, leo, mtl]

, or c) transfer learning 

[closer, rfs, dhillon2019baseline, wang2020few, negmargin, gidaris2018dynamic, qi2018low, rodriguez2020embedding], all of which aim to generalize to unseen classes. This conventional formulation is applicable if a query image corresponds to no less or more than a single class among target classes. To generalize FS-C to classify images associated with either none or multiple classes, we employ the multi-label classification [mccallum1999multi, boutell2004learning, cole2021multi, lanchantin2021general, durand2019learning]

. While the conventional FS-C methods make use of the class uniqueness property via using the categorical cross-entropy, we instead devise a learning framework that compares the binary relationship between the query and each support image individually and estimates a binary presence of the corresponding class.

Few-shot semantic segmentation (FS-S). A prevalent FS-S approach is learning to match a query feature map with a set of support feature embeddings that are obtained by collapsing spatial dimensions at the cost of spatial structures [wang2019panet, zhang2021self, siam2019amp, yang2021mining, liu2021anti, dong2018few, nguyen2019fwb, zhang2019canet, gairola2020simpropnet, yang2020pmm, liu2020ppnet]. Recent methods [zhang2019pgnet, xie2021scale, xie2021few, wu2021learning, tian2020pfenet] focus on learning structural details by leveraging dense feature correlation tensors between the query and each support. HSNet [hsnet] learns to squeeze a dense feature correlation tensor and transform it to a segmentation mask via high-dimensional convolutions that analyze the local correlation patterns on the correlation pyramid. We inherit the idea of learning to squeeze correlations and improve it by analyzing the spatial context of the correlation with effective global self-attention [transformers]. Note that several methods [yang2020brinet, wang2020dan, sun2021boosting] adopt non-local self-attention [nlsa] of the query-key-value interaction for FS-S, but they are distinct from ours in the sense that they learn to transform image feature maps, whereas our method focuses on transforming dense correlation maps via self-attention.

FS-S has been predominantly investigated as an one-way segmentation task, i.e., foreground or background segmentation, since the task is defined so that every target (support) class object appears in query images, thus being not straightforward to extend to a multi-class problem in the wild. Consequently, most work on FS-S except for a few [wang2019panet, tian2020differentiable, liu2020ppnet, dong2018few] focuses on the one-way segmentation, where the work of [tian2020differentiable, dong2018few] among the few presents two-way segmentation results from person-and-object images only, e.g., images containing (person, dog) or (person, table).

Comparison with other few-shot approaches. Here we contrast FS-CS with other loosely-related work for generalized few-shot learning. Few-shot open-set classification [liu2020few] brings the idea of the open-set problem [scheirer2012toward, fei2016breaking] to few-shot classification by allowing a query to have no target classes. This formulation enables background-aware classification as in FS-CS, whereas multi-label classification is not considered. The work of [tian2020generalized, ganea2021incremental] generalizes few-shot segmentation to a multi-class task, but it is mainly studied under the umbrella of incremental learning [mccloskey1989catastrophic, rebuffi2017icarl, castro2018end]. The work of [siam2020weakly] investigates weakly-supervised few-shot segmentation using image-level vision and language supervision, while FS-CS uses visual supervision only. The aforementioned tasks generalize few-shot learning but differ from FS-CS in the sense that FS-CS integrates two related problems under more general and relaxed constraints.

3 Problem formulation

Given a query image and a few support images for target classes, we aim to identify the presence of each class and predict its foreground mask from the query (Fig. 1), which we call the integrative few-shot classification and segmentation (FS-CS). Specifically, let us assume a target (support) class set of classes and its support set , which contains labeled instances for each of the classes, i.e., -way -shot [matchingnet, ravi2016optimization]. The label is either a class tag (weak label) or a segmentation annotation (strong label). For a given query image , we aim to identify the multi-hot class occurrence and also predict the segmentation mask corresponding to the classes. We assume the class set of the query is a subset of the target class set, i.e., , thus it is also possible to obtain and . This naturally generalizes the existing few-shot classification [matchingnet, protonet] and few-shot segmentation [shaban2017oslsm, rakelly2018cofcn].

Multi-label background-aware prediction. The conventional formulation of few-shot classification (FS-C) [matchingnet, protonet, maml] assigns the query to one class among the target classes exclusively and ignores the possibility of the query belonging to none or multiple target classes. FS-CS tackle this limitation and generalizes FS-C to multi-label classification with a background class. A multi-label few-shot classification learner compares semantic similarities between the query and the support images and estimates class-wise occurrences: where is an

-dimensional multi-hot vector each entry of which indicates the occurrence of the corresponding target class. Note that the query is classified into a

background class if none of the target classes were detected. Thanks to the relaxed constraint on the query, i.e., the query not always belonging to exactly one class, FS-CS is more general than FS-C.

Integration of classification and segmentation. FS-CS integrates multi-label few-shot classification with semantic segmentation by adopting pixel-level spatial reasoning. While the conventional FS-S [shaban2017oslsm, rakelly2018cofcn, wang2019panet, siam2019amp, nguyen2019fwb] assumes the query class set exactly matches the support class set, i.e., , FS-CS relaxes the assumption such that the query class set can be a subset of the support class set, i.e., . In this generalized segmentation setup along with classification, an integrative FS-CS learner estimates both class-wise occurrences and their semantic segmentation maps:

. This combined and generalized formulation gives a high degree of freedom to both of the few-shot learning tasks, which has been missing in the literature; the integrative few-shot learner can predict multi-label background-aware class occurrences and segmentation maps simultaneously under a relaxed constraint on the few-shot episodes.

Figure 2: Overview of ASNet. ASNet first constructs a hypercorrelation [hsnet] with image feature maps between a query (colored red) and a support (colored blue), where the 4D correlation is depicted as two 2D squares for demonstrational simplicity. ASNet then learns to transform the correlation to a foreground map by gradually squeezing the support dimension on each query dimension via global self-attention. Each input correlation, intermediate feature, and output foreground map has a channel dimension but is omitted in the illustration.

4 Integrative Few-Shot Learning (iFSL)

To solve the FS-CS problem, we propose an effective learning framework, integrative few-shot learning (iFSL). The iFSL framework is designed to jointly solve few-shot classification and few-shot segmentation using either a class tag or a segmentation supervision. The integrative few-shot learner takes as input the query image and the support set and then produces as output the class-wise foreground maps. The set of class-wise foreground maps is comprised of for classes:

(1)

where denotes the size of each map and is parameters to be meta-learned. The output at each position on the map represents the probability of the position being on a foreground region of the corresponding class.

Inference. iFSL infers both class-wise occurrences and segmentation masks on top of the set of foreground maps . For class-wise occurrences, a multi-hot vector

is predicted via max pooling followed by thresholding:

(2)

where denotes a 2D position, is a threshold, and denotes a set of integers from 1 to , i.e., . We find that inference with average pooling is prone to miss small objects in multi-label classification and thus choose to use max pooling. The detected class at any position on the spatial map signifies the presence of the class.

For segmentation, a segmentation probability tensor is derived from the class-wise foreground maps. As the background class is not given as a separate support, we estimate the background map in the context of the given supports; we combine class-wise background maps into an episodic background map on the fly. Specifically, we compute the episodic background map by averaging the probability maps of not being foreground and then concatenate it with the class-wise foreground maps to obtain a segmentation probability tensor :

(3)
(4)

The final segmentation mask is obtained by computing the most probable class label for each position:

(5)

Learning objective. The iFSL framework allows a learner to be trained using a class tag or a segmentation annotation using the classification loss or segmentation loss, respectively. The classification loss is formulated as the average binary cross-entropy between the spatially average-pooled class scores and its ground-truth class label:

(6)

where denotes the multi-hot encoded ground-truth class.

The segmentation loss is formulated as the average cross-entropy between the class distribution at each individual position and its ground-truth segmentation annotation:

(7)

where denotes the ground-truth segmentation mask.

These two losses share a similar goal of classification but differ in whether to classify each image or each pixel. Either of them is thus chosen according to the given level of supervision for training.

5 Model architecture

In this section, we present Attentive Squeeze Network (ASNet) of an effective iFSL model. The main building block of ASNet is the attentive squeeze layer (AS layer), which is a high-order self-attention layer that takes a correlation tensor and returns another level of correlational representation. ASNet takes as input the pyramidal cross-correlation tensors between a query and a support image feature pyramids, i.e., a hypercorrelation [hsnet]. The pyramidal correlations are fed to pyramidal AS layers that gradually squeeze the spatial dimensions of the support image, and the pyramidal outputs are merged to a final foreground map in a bottom-up pathway [hsnet, fpn, refinenet]. Figure 2 illustrates the overall process of ASNet. The -way output maps are computed in parallel and collected to prepare the class-wise foreground maps in Eq. (1) for iFSL.

5.1 Attentive Squeeze Network (ASNet)

Hypercorrelation construction. Our method first constructs hypercorrelations [hsnet] between a query and each support image and then learns to generate a foreground segmentation mask w.r.t. each support input. To prepare the input hypercorrelations, an episode, i.e., a query and a support set, is enumerated into a paired list of the query, a support image, and a support label: . The input image is fed to stacked convolutional layers in a CNN and its mid- to high-level output feature maps are collected to build a feature pyramid , where denotes the index of a unit layer, e.g., Bottleneck layer in ResNet50 [resnet]

. We then compute cosine similarity between each pair of feature maps from the pair of query and support feature pyramids to obtain 4D correlation tensors of size

, which is followed by ReLU 

[relu]:

(8)

These correlation tensors are grouped by groups of the identical spatial sizes, and then the tensors in each group are concatenated along a new channel dimension to build a hypercorrelation pyramid: such that the channel size corresponds to the number of concatenated tensors in the group. We denote the first two spatial dimensions of the correlation tensor, i.e., , as query dimensions, and the last two spatial dimensions, i.e., , as support dimensions hereafter.

Attentive squeeze layer (AS layer). The AS layer transforms a correlation tensor to another with a smaller support dimension via strided self-attention. The tensor is recast as a matrix with each element representing a support pattern. Given a correlation tensor in a hypercorrelation pyramid, we start by reshaping the correlation tensor as a block matrix of size with each element corresponding to a correlation tensor of on the query position such that

(9)

We call each element a support correlation tensor. The goal of an AS layer is to analyze the global context of each support correlation tensor and extract a correlational representation with a reduced support dimension while the query dimension is preserved: , where and . To learn a holistic pattern of each support correlation, we adopt the global self-attention mechanism [transformers] for correlational feature transform. The self-attention weights are shared across all query positions and processed in parallel.

Let us denote a support correlation tensor on any query position by for notational brevity as all positions share the following computation. The self-attention computation starts by embedding a support correlation tensor to a target 111 In this section, we adapt the term “target” to indicate the “query” embedding in the context of self-attention learning [transformers, vit, lsa, pvt, lrnet] to avoid homonymous confusion with the “query” image to be segmented. , key, value triplet: using three convolutions of which strides greater than or equal to one to govern the output size. The resultant target and key correlational representations, and , are then used to compute an attention context. The attention context is computed as following matrix multiplication:

(10)

Next, the attention context is normalized by softmax such that the votes on key foreground positions sum to one with masking attention by the support mask annotation if available to attend more on the foreground region:

(11)

The masked attention context is then used to aggregate the value embedding :

(12)

The attended representation is fed to an MLP layer, , and added to the input. In case the input and output dimensions mismatch, the input is optionally fed to a convolutional layer, . The addition is followed by an activation layer consisting of a group normalization [groupnorm] and a ReLU activation [relu]:

(13)

The output is then fed to another MLP that concludes a unit operation of an AS layer:

(14)

which is embedded to the corresponding query position in the block matrix of Eq. (9). Note that the AS layer can be stacked to progressively reduce the size of support correlation tensor, , to a smaller size. The overall pipeline of AS layer is illustrated in the supplementary material.

Multi-layer fusion.

The pyramid correlational representations are merged from the coarsest to the finest level by cascading a pair-wise operation of the following three steps: upsampling, addition, and non-linear transform. We first bi-linearly upsample the bottommost correlational representation to the query spatial dimension of its adjacent earlier one and then add the two representations to obtain a mixed one

. The mixed representation is fed to two sequential AS layers until it becomes a point feature of size , which is fed to the subsequent pyramidal fusion. The output from the earliest fusion layer is fed to a convolutional decoder, which consists of interleaved 2D convolution and bi-linear upsampling that map the -dimensional channel to 2 (foreground and background) and the output spatial size to the input query image size. See Fig. 2 for the overall process of multi-layer fusion.

Class-wise foreground map computation. The -shot output foreground activation maps are averaged to produce a mask prediction for each class. The averaged output map is normalized by softmax over the two channels of the binary segmentation map to obtain a foreground probability prediction .

1-way 1-shot 2-way 1-shot
classification 0/1 exact ratio (%) segmentation mIoU (%) classification 0/1 exact ratio (%) segmentation mIoU (%)
method avg. avg. avg. avg.
PANet [wang2019panet] 69.9 67.7 68.8 69.4 69.0 32.8 45.8 31.0 35.1 36.2 56.2 47.5 44.6 55.4 50.9 33.3 46.0 31.2 38.4 37.2
PFENet [tian2020pfenet] 69.8 82.4 68.1 77.9 74.6 38.3 54.7 35.1 43.8 43.0 22.5 61.7 40.3 39.5 41.0 31.1 47.3 30.8 32.2 35.3
HSNet [hsnet] 86.6 84.8 76.9 86.3 83.7 49.1 59.7 41.0 49.0 49.7 68.0 73.2 57.0 70.9 67.3 42.4 53.7 34.0 43.9 43.5
86.4 86.3 70.9 84.5 82.0 10.8 20.2 13.1 16.1 15.0 71.6 72.4 46.4 68.0 64.6 11.4 20.8 12.5 15.9 15.1
ASNet 84.9 89.6 79.0 86.2 84.9 51.7 61.5 43.3 52.8 52.3 68.5 76.2 58.6 70.0 68.3 48.5 58.3 36.3 48.3 47.8
Table 1: Performance comparison of ASNet and others on FS-CS and Pascal-5 [shaban2017oslsm]. All methods are trained and evaluated under the iFSL framework given strong labels, i.e., class segmentation masks, except for that is trained only with weak labels, i.e., class tags.
1-way 1-shot 2-way 1-shot
method ER mIoU ER mIoU
PANet [wang2019panet] 66.7 25.2 48.5 23.6
PFENet [tian2020pfenet] 71.4 31.9 36.5 22.6
HSNet [hsnet] 77.0 34.3 62.5 29.5
ASNet 78.6 35.8 63.1 31.6
Table 2:

Performance comparison of ASNet and others on FS-CS and COCO-20

 [nguyen2019fwb].

6 Experiments

In this section we report our experimental results regarding the FS-CS task, the iFSL framework, as well as the ASNet after briefly describing implementation details and evaluation benchmarks. See the supplementary material for additional results, analyses, and experimental details.

6.1 Experimental setups

Experimental settings. We select ResNet50 and ResNet-101 [resnet]

pretrained on ImageNet 

[russakovsky2015imagenet] as our backbone networks for a fair comparison with other methods and freeze the backbone during training as similarly as the previous work [tian2020pfenet, hsnet]. We train models using Adam [adam] optimizer with learning rate of and for the classification loss and the segmentation loss, respectively. We train all models with 1-way 1-shot training episodes and evaluate the models on arbitrary -way -shot episodes. For inferring class occurrences, we use a threshold . All the AS layers are implemented as multi-head attention with 8 heads. The number of correlation pyramid is set to .

Dataset. For the new task of FS-CS, we construct a benchmark adopting the images and splits from the two widely-used FS-S datasets, Pascal-5 [shaban2017oslsm, pascal] and COCO-20 [nguyen2019fwb, coco], which are also suitable for multi-label classification [wang2017multi]. Within each fold, we construct an episode by randomly sampling a query and an -way -shot support set that annotates the query with -way class labels and an -way segmentation mask in the context of the support set. For the FS-S task, we also use Pascal-5 and COCO-20 following the same data splits as [shaban2017oslsm] and [nguyen2019fwb], respectively.

Evaluation.

Each dataset is split into four mutually disjoint class sets and cross-validated. For multi-label classification evaluation metrics, we use the 0/1 exact ratio

 [durand2019learning]. In the supplementary material, we also report the results in accuracy . For segmentation, we use mean IoU  [shaban2017oslsm, wang2019panet], where denotes an IoU value of class.

Figure 3: 2-way 1-shot segmentation results of ASNet on FS-CS. The examples cover all three cases of , , and . The images are resized to square shape for visualization.

6.2 Experimental evaluation of iFSL on FS-CS

In this subsection, we investigate the iFSL learning framework on the FS-CS task. All ablation studies are conducted using ResNet50 on Pascal- and evaluated in 1-way 1-shot setup unless specified otherwise. Note that it is difficult to present a fair and direct comparison between the conventional FS-C and our few-shot classification task since FS-C is always evaluated on single-label classification benchmarks [matchingnet, tieredimagenet, cifarfs, metaoptnet, metadataset], whereas our task is evaluated on multi-label benchmarks [pascal, coco], which are irreducible to a single-label one in general.

Effectiveness of iFSL on FS-CS. We validate the iFSL framework on FS-CS and also compare the performance of ASNet with those of three recent state-of-the-art methods, PANet [wang2019panet], PFENet [tian2020pfenet], and HSNet [hsnet], which are originally proposed for the conventional FS-S task; all the models are trained by iFSL for a fair comparison. Note that we exclude the background merging step (Eqs. 3 and 4) for PANet as its own pipeline produces a multi-class output including background. Tables 1 and 2 validate the iFSL framework on the FS-CS task quantitatively, where our ASNet surpasses other methods on both 1-way and 2-way setups in terms of few-shot classification as well as the segmentation performance. The 2-way segmentation results are also qualitatively demonstrated in Fig. 3 visualizing exhaustive inclusion relations between a query class set and a target (support) class set in a 2-way setup.

Figure 4: -way 1-shot FS-CS performance comparison of four methods by varying from 1 to 5.

Weakly-supervised iFSL. The iFSL framework is versatile across the level of supervision: weak labels (class tags) or strong labels (segmentation masks). Assuming weak labels are available but strong labels are not, ASNet is trainable with the classification learning objective of iFSL (Eq. 6) and its results are presented as in Table 1. performs on par with ASNet in terms of classification ER (82.0% vs. 84.9% on 1-way 1-shot), but performs ineffectively on the segmentation task (15.0% vs. 52.3% on 1-way 1-shot). The result implies that the class tag labels are sufficient for a model to recognize the class occurrences, but are weak to endorse model’s precise spatial recognition ability.

Multi-class scalability of FS-CS. In addition, FS-CS is extensible to a multi-class problem with arbitrary numbers of classes, while FS-S is not as flexible as FS-CS in the wild. Figure 4 compares the FS-CS performances of four methods by varying the -way classes from one to five, where the other experimental setup follows the same one as in Table 1. Our ASNet shows consistently better performances than other methods on FS-CS in varying number of classes.

Robustness of FS-CS against task transfer. We evaluate the transferability between FS-CS, FS-C, and FS-S by training a model on one task and evaluating it on the other task. The results are compared in Fig. 5 in which ‘ FS-CS’ represents the result where the model trained on the FS-S task (with the guarantee of support class presence) is evaluated on the FS-CS setup. To construct training and validation splits for FS-C or FS-S, we sample episodes that satisfy the constraint of support class occurrences 222We sample 2-way 1-shot episodes having a single positive class for training on FS-C or evaluating on FS-C. We collect 1-way 1-shot episodes sampled from the same class for training on FS-S or evaluating on FS-S.. For training FS-C models, we use the class tag supervision only. All the other settings are fixed the same, e.g., we use ASNet with ResNet50 and Pascal-.

The results show that FS-CS learners, i.e., models trained on FS-CS, are transferable to the two conventional few-shot learning tasks and yet overcome their shortcomings. The transferability between few-shot classification tasks, i.e., FS-C and , is presented in Fig. 5 (a). On this setup, the learner is evaluated by predicting a higher class response between the two classes, although it is trained using the multi-label classification objective. The FS-CS learner closely competes with the FS-C learner on FS-C in terms of classification accuracy. In contrast, the task transfer between segmentation tasks, FS-S and FS-CS, results in asymmetric outcomes as shown in Fig. 5 (b) and (c). The FS-CS learner shows relatively small performance drop on FS-S, however, the FS-S learner suffers a severe performance drop on FS-CS. Qualitative examples in Fig. 1 demonstrate that the FS-S learner predicts a vast number of false-positive pixels and results in poor performances. In contrast, the FS-CS learner successfully distinguishes the region of interest by analyzing the semantic relevance of the query objects between the support set.

Figure 5: Results of task transfer. A B denotes a model trained on task A and evaluated on task B. denotes FS-CS with weak labels. (a): Exclusive 2-way 1-shot classification accuracy of FS-C or learners on FS-C. (b): 1-way 1-shot segmentation mIoU of FS-S or FS-CS learners on FS-CS. (c): 1-way 1-shot segmentation mIoU of FS-S or FS-CS learners on FS-S.
1-way 1-shot 1-way 5-shot # learn.
method mIoU FBIoU mIoU FBIoU params.
R50 CANet [zhang2019canet] 52.5 65.9 51.3 51.9 55.4 66.2 55.5 67.8 51.9 53.2 57.1 69.6 -
PPNet [liu2020ppnet] 47.8 58.8 53.8 45.6 51.5 69.2 58.4 67.8 64.9 56.7 62.0 75.8 23.5 M
PFENet [tian2020pfenet] 61.7 69.5 55.4 56.3 60.8 73.3 63.1 70.7 55.8 57.9 61.9 73.9 31.5 M
SAGNN [xie2021scale] 64.7 69.6 57.0 57.2 62.1 73.2 64.9 70.0 57.0 59.3 62.8 73.3 -
MMNet [wu2021learning] 62.7 70.2 57.3 57.0 61.8 - 62.2 71.5 57.5 62.4 63.4 - 10.4 M
CMN [xie2021few] 64.3 70.0 57.4 59.4 62.8 72.3 65.8 70.4 57.6 60.8 63.7 72.8 -
MLC [yang2021mining] 59.2 71.2 65.6 52.5 62.1 - 63.5 71.6 71.2 58.1 66.1 - 8.7 M
HSNet [hsnet] 64.3 70.7 60.3 60.5 64.0 76.7 70.3 73.2 67.4 67.1 69.5 80.6 2.6 M
ASNet 68.9 71.7 61.1 62.7 66.1 77.7 72.6 74.3 65.3 67.1 70.8 80.4 1.4 M
Table 3: FS-S results on 1-way 1-shot and 1-way 5-shot setups on Pascal-5 [shaban2017oslsm] using ResNet50 [resnet] (R50).
1-way 1-shot 1-way 5-shot # learn.
method mIoU FBIoU mIoU FBIoU params.
R50 RPMM [yang2020pmm] 30.6 - 35.5 - 38.6 M
RePRI [malik2021repri] 34.0 - 42.1 - -
MMNet [wu2021learning] 37.5 - 38.2 - 10.4 M
MLC [yang2021mining] 33.9 - 40.6 - 8.7 M
CMN [xie2021few] 39.3 61.7 43.1 63.3 -
HSNet [hsnet] 39.2 68.2 46.9 70.7 2.6 M
ASNet 42.2 68.8 47.9 71.6 1.4 M
Table 4: FS-S results on 1-way 1-shot and 1-way 5-shot setups on COCO-20 [nguyen2019fwb].
method ER mIoU
(a) global local 83.9 44.6
(b) w/o masked attention 83.8 50.8
(c) w/o multi-layer fusion 83.1 51.6
ASNet 84.9 52.3
Table 5: Ablation study of the AS layer on 1-way 1-shot on Pascal-5 [shaban2017oslsm] using ResNet50 [resnet].

6.3 Comparison with recent FS-S methods on FS-S

Tables 3 and 4 compare the results of the recent few-shot semantic segmentation methods and ASNet on the conventional FS-S task. All model performances in the tables are taken from corresponding papers, and the numbers of learnable parameters are either taken from papers or counted from their official sources of implementation. For a fair comparison with each other, some methods that incorporate extra unlabeled images [yang2021mining, liu2020ppnet] are reported as their model performances measured in the absence of the extra data. Note that ASNet in Tables 3 and 4 is trained and evaluated following the FS-S setup, not the proposed FS-CS one.

The results verify that ASNet outperforms the existing methods including the most recent ones [wu2021learning, xie2021few, yang2021mining]. Especially, the methods that cast few-shot segmentation as the task of correlation feature transform, ASNet and HSNet [hsnet], outperform other visual feature transform methods, indicating that learning correlations is beneficial for both FS-CS and FS-S. Note that ASNet is the most lightweight among others as ASNet processes correlation features that have smaller channel dimensions, e.g., at most 128, than visual features, e.g., at most 2048 in ResNet50.

6.4 Analyses on the model architecture

We perform ablation studies on the model architecture to reveal the benefit of each component. We replace the global self-attention in the ASNet layer with the local self-attention [lsa] to see the effect of the global self-attention (Table 5a). The local self-attention variant is compatible with the global ASNet in terms of the classification exact ratio but degrades the segmentation mIoU significantly, signifying the importance of the learning the global context of feature correlations. Next, we ablate the attention masking in Eq. (11), which verifies that the attention masking prior is effective (Table 5b). Lastly, we replace the multi-layer fusion path with spatial average pooling over the support dimensions followed by element-wise addition (Table 5c), and the result indicates that it is crucial to fuse outputs from the multi-layer correlations to precisely estimate class occurrence and segmentation masks.

7 Discussion

We have introduced the integrative task of few-shot classification and segmentation (FS-CS) that generalizes two existing few-shot learning problems. Our proposed integrative few-shot learning (iFSL) framework is shown to be effective on FS-CS, in addition, our proposed attentive squeeze network (ASNet) outperforms recent state-of-the-art methods on both FS-CS and FS-S. The iFSL design allows a model to learn either with weak or strong labels, that being said, learning our method with weak labels achieves low segmentation performances. This result opens a future direction of effectively boosting the segmentation performance leveraging weak labels in the absence of strong labels for FS-CS.

Acknowledgements.

This work was supported by Samsung Advanced Institute of Technology (SAIT) and also by Center for Applied Research in Artificial Intelligence (CARAI) grant funded by DAPA and ADD (UD190031RD).

References

A Supplementary Material

a.1 Detailed model architecture

The comprehensive configuration of attentive squeeze network is summarized in Table a.6, and its building block, attentive squeeze layer, is depicted in Fig. a.6. The channel sizes of the input correlation corresponds to , , for ResNet50 [resnet], ResNet101, VGG-16 [vgg], respectively.

a.2 Implementation details

Our framework is implemented on PyTorch 

[pytorch] using the PyTorch Lightning [falcon2019pytorch] framework. To reproduce the existing methods, we heavily borrow publicly available code bases. 333PANet [wang2019panet]: https://github.com/kaixin96/PANet
PFENet [tian2020pfenet]: https://github.com/dvlab-research/PFENet
HSNet [hsnet]: https://github.com/juhongm999/hsnet
We set the officially provided hyper-parameters for each method while sharing generic techniques for all the methods, e.g., excluding images of small support objects for support sets or switching the role between the query and the support during training. NVIDIA GeForce RTX 2080 Ti GPUs or NVIDIA TITAN Xp GPUs are used in all experiments, where we train models using two GPUs on Pascal- [shaban2017oslsm] while using four GPUs on COCO- [nguyen2019fwb]. Model training is halt either when it reaches the maximum epoch or when it starts to overfit. We resize input images to

without any data augmentation strategies during both training and testing time for all methods. For segmentation evaluation, we recover the two-channel output foreground map to its original image size by bilinear interpolation. Pascal-

and COCO- is derived from Pascal Visual Object Classes 2012 [pascal] and Microsoft Common Object in Context 2014 [coco], respectively. To construct episodes from datasets, we sample support sets such that one of the query classes is included in the support set by the probability of 0.5 to balance the ratio of background episodes across arbitrary benchmarks.

Figure a.6: Illustration of the proposed attentive squeeze layer (Sec. 5.1. in the main paper). The shape of each output tensor is denoted next to arrows.
[pool support dims. by half]
) ) )
[pool support dims.]
[upsample query dims.]
[element-wise addition]
[upsample query dims.]
[element-wise addition]
[upsample query dims.]
[interpolate query dims. to the input size]
Table a.6: Comprehensive configuration of ASNet of which overview is illustrated in Fig. 2 in the main paper. The top of the table is the input of the model and the detailed architecture of the model below it. denotes an AS layer of the kernel size (), stride (

), padding size (

) for the convolutional embedding with the input channel () and output channel ().

a.3 Further analyses

In this subsection we provide supplementary analyses on the iFSL framework and ASNet. All experimental results are obtained using ResNet50 on Pascal- and evaluated with 1-way 1-shot episodes unless specified otherwise.

Figure a.7: Classification threshold and its effects.

The classification occurrence threshold . Equation 2 in the main paper describes the process of detecting object classes on the shared foreground map by thresholding the highest foreground probability response on each foreground map. As the foreground probability is bounded from 0 to 1, we set the threshold for simplicity. A high threshold value makes a classifier reject insufficient probabilities as class presences. Figure a.7 shows the classification 0/1 exact ratios by varying the threshold, which reaches the highest classification performance around and . Fine-tuning the threshold for the best classification performance is not the focus of this work, thus we opt for the most straightforward threshold for all experiments.

Figure a.8: Visualization of background map for each support class and the merged background map for the query. High background response is illustrated in black.

Visualization of . Figure a.8 visually demonstrates the background merging step of iFSL in Eq. (3) in the main paper. The background maps are taken from the 2-way 1-shot episodes. The background response of the negative class is relatively even, i.e., the majority of pixels are estimated as background, whereas the background response of the positive class highly contributes to the merged background map.

1-way 1-shot 2-way 1-shot
classification 0/1 exact ratio (%) segmentation mIoU (%) classification 0/1 exact ratio (%) segmentation mIoU (%)
method avg. avg. avg. avg.
ASNet () 86.4 86.3 70.9 84.5 82.0 10.8 20.2 13.1 16.1 15.0 71.6 72.4 46.4 68.0 64.6 11.4 20.8 12.5 15.9 15.1
ASNet () 84.9 89.6 79.0 86.2 84.9 51.7 61.5 43.3 52.8 52.3 68.5 76.2 58.6 70.0 68.3 48.5 58.3 36.3 48.3 47.8
ASNet () 86.9 87.4 75.8 88.7 84.7 51.6 61.2 42.4 53.2 52.1 70.1 72.4 54.8 74.8 68.0 48.1 57.1 36.0 50.1 47.8
Table a.7: FS-CS results of ASNet trained with iFSL objectives. , , and corresponds to iFSL learning objectives given classification tags, segmentation annotations, or both, respectively.
1-way 1-shot 2-way 1-shot
classification 0/1 exact ratio (%) segmentation mIoU (%) classification 0/1 exact ratio (%) segmentation mIoU (%)
method avg. avg. avg. avg.
PANet [wang2019panet] 80.8 76.6 74.4 75.5 76.8 33.6 48.6 32.3 37.6 38.0 72.4 64.5 53.4 64.7 63.8 37.4 49.1 33.1 39.7 39.8
PFENet [tian2020pfenet] 68.4 83.0 65.8 75.2 73.1 37.7 55.3 34.5 44.8 43.1 25.9 56.2 44.6 38.8 41.4 31.2 47.2 28.9 33.5 35.2
HSNet [hsnet] 86.6 86.6 75.7 86.0 83.7 49.0 60.6 42.5 52.3 51.1 74.6 74.4 55.6 70.8 68.9 40.9 52.0 36.4 47.8 44.3
ASNet 87.2 88.1 77.2 87.2 84.9 53.5 62.0 43.9 55.1 53.6 73.1 76.8 56.7 74.7 70.3 49.5 56.3 40.0 50.0 48.9
Table a.8: FS-CS results on Pascal-5 using ResNet101.

iFSL with weak labels, strong labels, and both. Table a.7 compares FS-CS performances of three ASNets each of which trained with the classification loss (Eq. (6) in the main paper), the segmentation loss (Eq. (7) in the main paper), or both. The loss is chosen upon the level of supervisions on support sets; classification tags (weak labels) or segmentation annotations (strong labels). We observe that neither the classification nor segmentation performances deviate significantly between and ; their performances are not even 0.3%p different. As a segmentation annotation is a dense form of classification tags, thus the classification loss influences insignificantly when the segmentation loss is used for training. We thus choose to use the segmentation loss exclusively in the presence of segmentation annotations.

a.4 Additional results

Here we provide several extra experimental results that are omitted in the main paper due to the lack of space. The contents include results using other backbone networks, another evaluation metric, and shots where .

iFSL on FS-CS using ResNet101. We include the FS-CS results of the iFSL framework on Pascal- using ResNet101 [resnet] in Table a.8, which is missing in the main paper due to the page limit. All other experimental setups are matched with those of Table 1 in the main paper except for the backbone network. ASNet also shows greater performances than the previous methods on both classification and segmentation tasks with another backbone.

2-way 1-shot
classification 0/1 exact ratio (%) classification accuracy (%)
method avg. avg.
PANet [wang2019panet] 56.2 47.5 44.6 55.4 50.9 74.9 70.2 67.8 74.8 71.9
PFENet [tian2020pfenet] 22.5 61.7 40.3 39.5 41.0 64.1 79.5 66.4 66.1 69.0
HSNet [hsnet] 68.0 73.2 57.0 70.9 67.3 82.4 85.6 76.0 84.5 82.1
71.6 72.1 46.4 68.0 64.6 84.9 85.4 69.2 82.2 80.4
ASNet 68.5 76.2 58.6 70.0 68.3 82.9 87.5 76.7 84.0 82.8
Table a.9: FS-CS classification accuracy (%) and 0/1 exact ratio (%) on Pascal-5 using ResNet50.

FS-CS classification metrics: 0/1 exact ratio and accuracy. Table a.9 presents the results of two classification evaluation metrics of FS-CS: 0/1 exact ratio [durand2019learning] and classification accuracy. The classification accuracy metric takes the average of correct predictions for each class for each query, while 0/1 exact ratio measures the binary correctness for all classes for each query, thus being stricter than the accuracy; the exact formulations are in Sec. 6.1. of the main paper. ASNet shows higher classification performance in both classification metrics than others.

1-way 5-shot 2-way 5-shot
classification 0/1 exact ratio (%) segmentation mIoU (%) classification 0/1 exact ratio (%) segmentation mIoU (%)
method avg. avg. avg. avg.
PANet [wang2019panet] 72.5 70.2 70.7 74.6 72.0 45.6 56.2 44.6 49.2 48.9 61.1 46.8 44.0 66.2 54.5 46.2 57.4 46.7 47.6 49.5
PFENet [tian2020pfenet] 70.9 84.5 67.1 80.4 75.7 42.8 56.3 36.2 47.3 45.7 22.3 63.2 42.5 40.6 42.2 35.9 50.5 33.3 35.4 38.8
HSNet [hsnet] 91.1 88.1 82.0 90.7 88.0 56.2 61.3 40.2 54.2 53.0 79.7 81.0 65.0 81.0 76.7 42.5 58.9 32.0 44.1 44.4
ASNet 90.5 90.4 82.3 91.8 88.8 59.2 63.5 41.2 58.7 55.7 81.4 81.4 68.0 80.6 77.9 53.4 60.4 35.9 50.6 50.1
Table a.10: FS-CS results on 5-shot setups on Pascal-5 using ResNet50.
1-way 5-shot 2-way 5-shot
classification 0/1 exact ratio (%) segmentation mIoU (%) classification 0/1 exact ratio (%) segmentation mIoU (%)
method avg. avg. avg. avg.
PANet [wang2019panet] 83.7 81.6 78.3 81.3 81.2 48.2 59.1 45.5 50.5 50.8 79.0 68.4 60.5 72.3 70.1 49.1 59.6 46.8 50.1 51.4
PFENet [tian2020pfenet] 70.3 85.3 65.9 78.6 75.0 42.2 56.0 35.7 48.7 45.7 26.9 56.0 49.2 37.3 42.4 35.7 49.6 31.4 36.9 38.4
HSNet [hsnet] 91.4 89.5 79.4 90.9 87.8 55.2 64.2 41.7 58.4 54.9 85.6 80.8 61.3 81.7 77.4 38.5 57.6 34.8 49.8 45.2
ASNet 91.5 90.2 80.6 93.4 88.9 60.3 64.7 41.4 58.5 56.2 82.8 81.1 65.1 85.5 78.6 53.8 61.0 34.2 52.2 50.3
Table a.11: FS-CS results on 5-shot setups on Pascal-5 using ResNet101.

iFSL on 5-shot FS-CS. Tables a.10 and a.11 compares four different methods on the 1-way 5-shot and 2-way 5-shot FS-CS setups, which are missing in the main paper due to the page limit. All other experimental setups are matched with those of Table 1 in the main paper except for the number of support samples for each class, i.e., varying shots. ASNet also outperforms other methods on the multi-shot setups.

1-way 1-shot 1-way 5-shot # learn.
method mIoU FBIoU mIoU FBIoU params.
VGG-16 OSLSM [shaban2017oslsm] 33.6 55.3 40.9 33.5 40.8 - 35.9 58.1 42.7 39.1 43.9 - 276.7 M
PANet [wang2019panet] 42.3 58.0 51.1 41.2 48.1 66.5 51.8 64.6 59.8 46.5 55.7 70.7 14.7 M
FWB [nguyen2019fwb] 47.0 59.6 52.6 48.3 51.9 - 50.9 62.9 56.5 50.1 55.1 - -
RPMMs [yang2020pmm] 47.1 65.8 50.6 48.5 53.0 - 50.0 66.5 51.9 47.6 54.0 - -
PFENet [tian2020pfenet] 56.9 68.2 54.4 52.4 58.0 72.0 59.0 69.1 54.8 52.9 59.0 72.3 10.4 M
HSNet [hsnet] 59.6 65.7 59.6 54.0 59.7 73.4 64.9 69.0 64.1 58.6 64.1 76.6 2.6 M
ASNet 61.7 66.7 58.6 55.3 60.6 72.6 66.5 68.7 63.0 58.4 64.1 75.8 1.4 M
R101 FWB [nguyen2019fwb] 51.3 64.5 56.7 52.2 56.2 - 54.8 67.4 62.2 55.3 59.9 - 43.0 M
DAN [wang2020dan] 54.7 68.6 57.8 51.6 58.2 71.9 57.9 69.0 60.1 54.9 60.5 72.3 -
RePRI [malik2021repri] 59.6 68.6 62.2 47.2 59.4 - 66.2 71.4 67.0 57.7 65.6 - 65.7 M
PFENet [tian2020pfenet] 60.5 69.4 54.4 55.9 60.1 72.9 62.8 70.4 54.9 57.6 61.4 73.5 10.8 M
MLC [yang2021mining] 60.8 71.3 61.5 56.9 62.6 - 65.8 74.9 71.4 63.1 68.8 - 27.7 M
HSNet [hsnet] 67.3 72.3 62.0 63.1 66.2 77.6 71.8 74.4 67.0 68.3 70.4 80.6 2.6 M
ASNet 69.0 73.1 62.0 63.6 66.9 78.0 73.1 75.6 65.7 69.9 71.1 81.0 1.4 M
Table a.12: FS-S results on 1-way 1-shot and 1-way 5-shot setups on PASCAL-5 using VGG-16 [vgg] and ResNet101 [resnet].
-way 1-shot
classification 0/1 exact ratio (%) segmentation mIoU (%)
method 1 2 3 4 5 1 2 3 4 5
PANet [wang2019panet] 69.0 50.9 39.3 29.1 22.2 36.2 37.2 37.1 36.6 35.3
PFENet [tian2020pfenet] 74.6 41.0 24.9 14.5 7.9 43.0 35.3 30.8 27.6 24.9
HSNet [hsnet] 82.7 67.3 52.5 45.2 36.8 49.7 43.5 39.8 38.1 36.2
ASNet 84.9 68.3 55.8 46.8 37.3 52.3 47.8 45.4 44.5 42.4
Table a.13: Numerical results of Fig. 4 in the main paper: FS-CS performances on -way 1-shot by varying from 1 to 5.

ASNet on FS-S using VGG-16. Table a.12 compares the recent state-of-the-art methods and ASNet on FS-S using VGG-16 [vgg]. We train and evaluate ASNet with the FS-S problem setup to fairly compare with the recent methods. All the other experimental variables are detailed in Sec. 6.3. and Table 3 of the main paper. ASNet consistently shows outstanding performances using the VGG-16 backbone network as observed in experimnets using ResNets.

Figure a.9: 2-way 1-shot FS-CS segmentation prediction maps on the COCO- benchmark.

Qualitative results. We attach additional segmentation predictions of ASNet learned with the iFSL framework on the FS-CS task in Fig. a.9. We observe that ASNet successfully predicts segmentation maps at challenging scenarios in the wild such as a) segmenting tiny objects, b) segmenting non-salient objects, c) segmenting multiple objects, and d) segmenting a query given a small support object annotation.

Figure a.10: 2-way 1-shot FS-CS segmentation prediction maps of and .

Qualitative results of . Figure a.10 visualizes typical failure cases of the model in comparison with ; these examples qualitatively show the severe performance drop of on FS-CS, which is quantitatively presented in Fig. 5 (b) of the main paper. Sharing the same architecture of ASNet, each model is trained on either FS-S or FS-CS setup and evaluated on the 2-way 1-shot FS-CS setup. The results demonstrate that is unaware of object classes and gives foreground predictions on any existing objects, whereas effectively distinguishes the object classes based on the support classes and produces clean and adequate segmentation maps.

1-way 1-shot 2-way 1-shot
classification 0/1 exact ratio (%) segmentation mIoU (%) classification 0/1 exact ratio (%) segmentation mIoU (%)
method avg. avg. avg. avg.
PANet [wang2019panet] 64.3 66.5 68.0 67.9 66.7 25.5 24.7 25.7 24.7 25.2 42.5 49.9 53.6 47.8 48.5 24.9 25.0 23.3 21.4 23.6
PFENet [tian2020pfenet] 70.7 70.6 71.2 72.9 71.4 30.6 34.8 29.4 32.6 31.9 35.6 34.3 43.1 32.8 36.5 23.3 23.8 20.2 23.1 22.6
HSNet [hsnet] 74.7 77.2 78.5 77.6 77.0 36.2 34.3 32.9 34.0 34.3 57.7 62.4 67.1 62.6 62.5 28.9 29.6 30.3 29.3 29.5
ASNet 76.2 78.8 79.2 80.2 78.6 35.7 36.8 35.3 35.6 35.8 59.5 61.5 68.8 62.4 63.1 29.8 33.0 33.4 30.4 31.6
Table a.14: Fold-wise FS-CS results on COCO-20 using ResNet50. The results correspond to the Table 2 in the main paper.
1-way 1-shot 1-way 5-shot # learn.
method mIoU FBIoU mIoU FBIoU params.
R50 RPMM [yang2020pmm] 29.5 36.8 28.9 27.0 30.6 - 33.8 42.0 33.0 33.3 35.5 - 38.6 M
RePRI [malik2021repri] 31.2 38.1 33.3 33.0 34.0 - 38.5 46.2 40.0 43.6 42.1 - -
MMNet [wu2021learning] 34.9 41.0 37.2 37.0 37.5 - 37.0 40.3 39.3 36.0 38.2 - 10.4 M
MLC [yang2021mining] 46.8 35.3 26.2 27.1 33.9 - 54.1 41.2 34.1 33.1 40.6 - 8.7 M
CMN [xie2021few] 37.9 44.8 38.7 35.6 39.3 61.7 42.0 50.5 41.0 38.9 43.1 63.3 -
HSNet [hsnet] 36.3 43.1 38.7 38.7 39.2 68.2 43.3 51.3 48.2 45.0 46.9 70.7 2.6 M
ASNet 41.5 44.1 42.8 40.6 42.2 68.8 47.6 50.1 47.7 46.4 47.9 71.6 1.4 M
R101 FWB [nguyen2019fwb] 17.0 18.0 21.0 28.9 21.2 - 19.1 21.5 23.9 30.1 23.7 - 43.0 M
DAN [wang2020dan] - - - - 24.4 62.3 - - - - 29.6 63.9 -
PFENet [tian2020pfenet] 34.3 33.0 32.3 30.1 32.4 58.6 38.5 38.6 38.2 34.3 37.4 61.9 10.8 M
SAGNN [xie2021scale] 36.1 41.0 38.2 33.5 37.2 60.9 40.9 48.3 42.6 38.9 42.7 63.4 -
MLC [yang2021mining] 50.2 37.8 27.1 30.4 36.4 - 57.0 46.2 37.3 37.2 44.4 - 27.7 M
HSNet [hsnet] 37.2 44.1 42.4 41.3 41.2 69.1 45.9 53.0 51.8 47.1 49.5 72.4 2.6 M
ASNet 41.8 45.4 43.2 41.9 43.1 69.4 48.0 52.1 49.7 48.2 49.5 72.7 1.4 M
Table a.15: Fold-wise FS-S results on 1-way 1-shot and 1-way 5-shot setups on COCO-20 using ResNet50 (R50) and ResNet101 (R101).

Fold-wise results on COCO-. Tables a.14 and a.15 present fold-wise performance comparison on the FS-CS and FS-S tasks, respectively. We validate that ASNet outperforms the competitors by large margins in both the FS-CS and FS-S tasks on the challenging COCO- benchmark.

Numerical performances of Fig. 4 in the main paper. We report the numerical performances of the Fig. 4 in the main paper in Table a.13 as a reference for following research.