Semantically Meaningful Class Prototype Learning for One-Shot Image Semantic Segmentation

02/22/2021
by   Tao Chen, et al.
0

One-shot semantic image segmentation aims to segment the object regions for the novel class with only one annotated image. Recent works adopt the episodic training strategy to mimic the expected situation at testing time. However, these existing approaches simulate the test conditions too strictly during the training process, and thus cannot make full use of the given label information. Besides, these approaches mainly focus on the foreground-background target class segmentation setting. They only utilize binary mask labels for training. In this paper, we propose to leverage the multi-class label information during the episodic training. It will encourage the network to generate more semantically meaningful features for each category. After integrating the target class cues into the query features, we then propose a pyramid feature fusion module to mine the fused features for the final classifier. Furthermore, to take more advantage of the support image-mask pair, we propose a self-prototype guidance branch to support image segmentation. It can constrain the network for generating more compact features and a robust prototype for each semantic class. For inference, we propose a fused prototype guidance branch for the segmentation of the query image. Specifically, we leverage the prediction of the query image to extract the pseudo-prototype and combine it with the initial prototype. Then we utilize the fused prototype to guide the final segmentation of the query image. Extensive experiments demonstrate the superiority of our proposed approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 5

page 6

page 7

page 9

page 11

page 12

11/24/2021

APANet: Adaptive Prototypes Alignment Network for Few-Shot Semantic Segmentation

Few-shot semantic segmentation aims to segment novel-class objects in a ...
07/03/2020

Deep Fence Estimation using Stereo Guidance and Adversarial Learning

People capture memorable images of events and exhibits that are often oc...
04/21/2022

Beyond the Prototype: Divide-and-conquer Proxies for Few-shot Segmentation

Few-shot segmentation, which aims to segment unseen-class objects given ...
04/30/2020

SimPropNet: Improved Similarity Propagation for Few-shot Image Segmentation

Few-shot segmentation (FSS) methods perform image segmentation for a par...
10/22/2018

SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation

One-shot semantic segmentation poses a challenging task of recognizing t...
05/14/2021

Attentional Prototype Inference for Few-Shot Semantic Segmentation

This paper aims to address few-shot semantic segmentation. While existin...
04/05/2021

Adaptive Prototype Learning and Allocation for Few-Shot Segmentation

Prototype learning is extensively used for few-shot segmentation. Typica...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep convolutional neural networks (CNNs) have achieved significant breakthroughs in many computer tasks (

e.g., image classification [1, 2, 3, 4, 5], image retrieving [6, 7], object detection [8], and semantic segmentation [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], video streaming [20, 21]). However, training deep CNNs typically requires large-scale labeled datasets, which is expensive to obtain. Though semi-supervised [22], weakly-supervised [23, 24, 25, 26, 27], and unsupervised methods [28, 29, 30, 31, 32, 33, 34, 35, 36, 37] are recently proposed to alleviate the annotation burden, these traditional deep CNNs are trained for predefined classes. Consequently, they cannot generalize well to the tasks with new incremental categories that are not defined during training. Even if given several samples of the novel classes, the trained network is still hard to be fine-tuned in a data-efficient way [39, 40, 41, 38]. In contrast, with the learning of only one annotated image containing the new class, humans can successfully recognize this new category and segment its regions. To imitate such a generalization ability of human beings, researchers recently move their attention to the task of one-shot learning, which can also reduce the data-gathering effort. In this paper, we focus on solving the one-shot learning problem for semantic segmentation. In our setting, each test image contains one new target category, and our task of one-shot image segmentation [42, 43, 44, 45, 46, 47, 48, 49] is to predict the object regions for that unseen semantic class with only one annotated image for that new class.

Here, we first define the terms for "support image", "support mask", "query image", "pseudo-mask", "support prototype" and "pseudo-prototype". We refer to the image-mask pair that provides clues for the new class as the support set (the image and mask are denoted as the support image and support mask, respectively). Meanwhile, we refer to the image that needs to be segmented as the query image. During testing, we denote the initial prediction of the query image as the pseudo-mask. The class prototype obtained with the support image-mask pair is denoted as the support prototype. In contrast, the prototype obtained with the query image and its pseudo-mask is denoted as the pseudo-prototype.

Fig. 1: Overview of our framework. Different from existing works, we propose to leverage the multi-class label information during the episodic training. After obtaining the guided features, we propose a pyramid feature fusion module to mine the corresponding features for the target class in a multi-scale way.

Humans’ capability of learning with only limited examples for new tasks is, to some extent, based on their past experiences. It is thus crucial for one-shot learning to leverage the prior knowledge, for example, a large amount of annotated seen class images. Since fine-tuning a traditional semantic segmentation network on a single labeled image for the new category is prone to over-fitting, recent works adopt an episodic strategy when training the network on a dataset that can provide the prior knowledge. For the episodic training, the setting is constructed to mimic the expected situation at the testing time. When conditioned on a single support example with the annotation of the unseen class, the network trained in such a meta-learning way aims to perform segmentation well for the target class in the query image. However, these existing approaches simulate the test conditions too strictly during the training process. They cannot make full use of the given label information. For example, these approaches mainly focus on the foreground-background target class segmentation setting. They only utilize binary mask labels for training and discard the multi-class label information. Without the multi-class label information, the features extracted from the encoder will lack semantic information. The network is also more prone to over-fitting, which hinders the design of subsequent fusion networks for the guided features. Besides, in the existing training strategy for one-shot segmentation, the support image-mask pair is only used in the conditional branch to provide guidance information for the segmentation branch, which can be further utilized to facilitate the network training.

To tackle the above challenges, as illustrated in Fig. 1, in this work, we first propose to take advantage of the multi-class label information during the episodic training. This will encourage the feature encoder to generate more semantically meaningful feature representations for both support and query images. The class-aware semantic features of the support branch will then lead to a semantically meaningful class prototype for the target class. We then further propose a pyramid feature fusion module to integrate the target class cues obtained from the support branch into the query features. Our pyramid feature fusion module can mine the corresponding features for the target class in a multi-scale way to generate class-agnostic but robust features for the final atrous spatial pyramid pooling (ASPP) [17] classifier. That is, on the one hand, our feature encoder focuses on extracting class-aware semantic feature representations from the input images. On the other hand, the pyramid feature fusion module aims to exploit the fused information and generate class-agnostic features for the binary prediction of the target category.

Additionally, to take more advantage of the support image-mask pair, we propose a self-prototype guidance branch for the support image segmentation. After obtaining the target class prototype with the support image-mask pair, as illustrated in Fig. 2, we not only use it to guide the segmentation of the query image but also leverage the class prototype to guide the segmentation of the support image. Our proposed training scheme will constrain the network to generate more compact features and robust prototype for each semantic class. This will help the network to locate the corresponding regions of the target category more accurately. Finally, we propose a fused prototype guidance branch at the test time. It will help obtain a more robust target class prototype for the segmentation of the query image. Since the label of the query image is not available in the test phase, we take the predicted binary segmentation map as the pseudo-mask to extract the pseudo-prototype for the query branch. Then we fuse the class prototype generated from the support image and pseudo-prototype from the query image. Then we utilize the fused prototype to guide the final segmentation of the query image. The main contributions of this work are:

1) We propose to leverage the multi-class label information to constrain the network during the episodic training of one-shot semantic image segmentation. This will help the network extract class-aware semantic feature representations from the input images and generate a more semantically meaningful class prototype for the target category.

2) To take advantage of the support image-mask pair, we propose a self-prototype guidance branch for the support image segmentation. This will encourage the network to learn more compact features and a more robust target class prototype, and thus can help to locate the corresponding area of the target class more accurately.

3) During testing, we propose a fused prototype guidance branch to leverage the target class prototypes from both support and query images to guide the final segmentation of the query image.

4) Extensive experiments on the PASCAL-

and COCO-

datasets demonstrate the superiority of our proposed approach for one-shot image segmentation.

The rest of this paper is organized as follows: the related work is described in Section II and our approach is introduced in Section III; we then report our evaluations on two widely-used datasets for one-shot image segmentation task in Section IV; we report the ablation studies in Section V and finally conclude our work in Section VI.

Ii Related Work

Ii-a Semantic Segmentation

Different from image classification that only needs to attach one label to an input image, semantic segmentation [9, 10, 11, 16, 17, 18] aims to label every pixel of the input image. After the introduction of the fully convolutional network (FCN) [9]

, deep learning has achieved great success in semantic segmentation. For example, the early work of UNet

[10] and SegNet [11] used encoder-decoder architecture to recover the spatial resolution of the input image, for medical image segmentation and street scene parsing, respectively. Then dilated/atrous convolution was proposed in [16, 17] to enlarge the receptive field of the network while maintaining the spatial resolution of the feature maps. To produce high-resolution segmentation maps, RefineNet [19] leveraged a cascaded architecture to combine low-resolution semantic features and fine-grained low-level features in a recursive manner. Recently, contextual information for semantic segmentation was explored in [50] to capture the global semantic context of scenes, and selectively highlight class-dependent features. To capture long-range dependencies, the works of [51, 52, 18]

leveraged non-local neural networks for self-attention to adaptively integrate local features with their global dependencies. A joint multi-task learning framework for semantic segmentation and boundary detection

[53] was proposed. In their work, an iterative pyramid context module was adopted to couple the two tasks and store the shared latent semantics to interact between semantic segmentation and boundary detection. Dynamic routing for semantic segmentation [54]

was proposed to alleviate the scale variance. They generated data-dependent routes for adapting to the scale distribution of each image.

Ii-B One-Shot Classification

One-shot classification algorithms [55, 56, 57, 58, 59, 60, 61]

aim to learn information about object categories with only one training example for each category. Due to the rareness of the sample, recent works choose to generalize knowledge acquired from seen classes during training to new categories rather than directly using supervised learning-based approaches. For example, model-agnostic meta-learning

[55] was compatible with any gradient-based training model. It can generalize well with only a small number of training samples. MetaNet [56] acquired knowledge in a meta-learning way. It transferred its parameters and biases via fast parameterization for the new tasks. Recently, metric-based approaches were proposed to learn a metric space for classification during meta-learning. Prototypical networks [57] proposed to classify images of new classes by computing distances to prototype representations of each class. Relation network [58] proposed to classify images by computing the relation scores between the query and the support image of each new category. To address the high-variance issue, the ensemble of deep networks was designed in [62] to encourage the networks to cooperate. The work of [60] jointly incorporated visual feature learning, knowledge inferring, and classifier learning into one unified framework for knowledge transfer. DeepEMD [63] formalized the one-shot classification as an optimal matching problem between image regions. Earth Mover’s Distance was adopted in their work as the distance metric between the structured representations to determine image relevance.

Ii-C One-Shot Semantic Segmentation

One-shot semantic image segmentation is the task of segmenting the object pixels with only one annotated image [42, 43, 49, 45, 46, 44, 47, 64, 65]. The target class is defined by the binary ground-truth segmentation mask of the support image. Shaban et al. proposed the first model OSLSM [42] for one-shot semantic segmentation with a two-branched architecture. In their approach, a conditioning branch was adopted to analyze the target class in the support image. Then parameters were generated for the query features to perform segmentation. Recent progress for one-shot segmentation mainly followed such two-branched architecture. For example, co-FCN [43] and CANet [49] extended OSLSM [42] by leveraging the support branch to extract an encoded feature embedding. Then the feature representation that contains the information of the target class was later fused to the query branch as guided information. A-MCG [45] proposed an attention-based multi-context guiding network to integrate multi-scale context features between support and query branches. And spatial attention along the fusion branch was adopted in their work to enhance self-supervision in one-shot learning. Another extended version was the similarity guidance network SG-One [46]. The similarity between the masked pooled support feature and feature maps of the query image was calculated to guide the segmentation of the query branch. While the above methods adopt a parametric module to fuse the support information with query features, PL [44] and PANet [47] proposed to leverage non-parametric metric learning to solve the segmentation task. They proposed to extract class-specific prototype representations within an embedding space. Segmentation was then performed over the query images by matching each pixel to the learned prototypes. The key to the success of these methods lies in the effective utilization of information from the support image to assist the segmentation of the query image. They leverage metric learning to perform feature matching [42, 43, 49, 45, 44] (or calculate a certain distance [46, 47]

, such as cosine similarity) to mine the connection between the support and query images, thereby improving the query image’s segmentation results. However, existing methods only utilize binary mask labels for training and discard the multi-class label information. Therefore, features extracted from the encoder will be less semantically meaningful, which hinders the design of subsequent fusion networks for the guided features.

Fig. 2:

Illustration of our training framework. For each episode, a feature encoder is first used to extract deep features for the support and query images. On the one hand, the support and query features are forwarded to the multi-class label guidance branch for traditional semantic segmentation training. On the other hand, masked average pooling is applied with the support mask to obtain the target class feature prototype from the deep support feature maps. The target class prototype vector is then up-sampled and concatenated to both the support and query features for the binary prediction of the target category. Best viewed in color.

Iii The Proposed Approach

Iii-a Problem Setting

Our task is to learn a segmentation model, which can predict the pixels of the unseen semantic class in the query image, given only one support image with a corresponding binary mask for during testing. Note that we follow the early works [42, 43, 44, 45, 46, 47] and focus on solving the one-way setting, which means each image contains only one target class. During training, we can access a large set of images with pixel-level ground-truth labels. For each training image , we denote its semantic label as , where (h, w) is the size of the image and c is the number of classes. For one-shot semantic image segmentation setting, and are two non-overlapping sets of classes (), which is different from the traditional image segmentation task. In other words, the class used during testing is never labeled during the training process.

We take the widely adopted episodic training strategy to match the expected situation at testing time. Each training episode instantiates a one-shot semantic image segmentation task for a class . Specifically, two images containing class are selected as a pair of support-query images. Similarly, each testing episode instantiates an one-shot task for a class . Two images containing class are selected as a pair of support-query images. Unlike existing one-shot image segmentation approaches [42, 43, 44, 45, 46, 47] that only leverage the binary mask for class , we also take advantage of the pixel-level ground-truth labels during training like the traditional semantic segmentation.

Iii-B Base Network

Fig. 2 shows the architecture of our episodic training scheme. For each episode, the base network first uses a feature encoder to extract deep features for the support and query images. Then the masked average pooling [46] is applied with the support mask to obtain the prototype vector for the target class from the support feature maps. We up-sample the support features to the same size of the mask and then the prototype is calculated as:

(1)

where is the size of the input image and mask. We then use this prototype containing the class information to guide the segmentation of the query image. We up-sample the prototype vector to the same spatial size of query features and concatenate it to the query features. For the base network, we leverage one convolutional layer with the kernel size of to fuse the concatenated features. This convolutional layer will be replaced by our proposed pyramid feature fusion module after we introduce the multi-class label information. Then a foreground and background ASPP classifier is applied on top of the fused features to get the segmentation map of the query image. The one-shot segmentation loss for the query image is defined as the cross-entropy loss between the prediction and query mask :

(2)

Here, is the class label that denotes whether the pixel belongs to the target class. is the size of the input query image and mask.

Iii-C Multi-Class Label Guidance

For traditional semantic image segmentation, pixel-level multi-class labels are provided to train the network. When it comes to the one-shot task, recent works only leverage the foreground and background binary mask for the target class. They intend to meta-learn a class agnostic segmentation network. However, as pointed out in OSLSM [42]

, part of the recent algorithms’ ability to generalize well to unseen classes is benefiting from the pre-training performed on ImageNet

[66]. The weak image-level annotations for a large number of categories endow the pre-trained network with the particular ability to extract discriminative features for both seen classes during training and unseen classes during testing. The existing approaches without any label information during training will gradually result in less semantically meaningful features. Therefore, we propose to leverage the pixel-level multi-class labels to constrain the feature extraction of the encoder. This will encourage the encoder to generate more discriminative features for each category. The more semantically meaningful features will lead to the target class feature prototype with more robust class information. Thus, the prototype will help the network locate the area for the target class more accurately. As shown in Fig. 2, we apply a parameter-shared multi-class classifier on top of the features of both support and query images for traditional multi-class semantic image segmentation. The multi-class segmentation loss is defined as the cross-entropy loss between the prediction and multi-class label :

(3)

Here, and is the size of the input image and mask. With the multi-class label guidance, our feature encoder focuses on extracting class-aware semantic feature representations. Our pyramid feature fusion module will then process the fused features and generate class-agnostic features for the binary prediction of the target category.

Fig. 3: The architecture of our proposed pyramid feature fusion module.
Fig. 4: Illustration of the testing framework for our proposed approach. After obtaining the segmentation map for the query image, we treat it as the pseudo-mask to extract the pseudo-prototype for the target class. Then we fuse the pseudo-prototype with the support prototype to guide the final segmentation of the query image. Best viewed in color.

Iii-D Pyramid Feature Fusion Module

Given more discriminative features and semantically meaningful guidance information for the target class, we propose a pyramid feature fusion module to better integrate the segmentation cues into query features. The architecture is shown in Fig. 3. The module’s input is the concatenation of query feature maps and the up-sampled support prototype. We first apply a convolutional layer to reduce the dimension of the concatenated features from 1024 to 512. We then down-sample feature maps to 1/2 and 1/4 of the original spatial size. After that, we forward features of each size into a convolutional layer with 512 filters separately to mine the features in a multi-scale way. We then up-sample the features to their original size before adding them together element-wisely. Finally, we add two residual blocks to enhance the fused features further and to obtain the class-agnostic features for the binary prediction of the target category. Each residual block contains three convolutional layers with 64, 64, and 512 filters separately. Our above pyramid feature fusion module can mine the corresponding features for the target class to generate features robust to the object scales for the final ASPP classifier. Compared with pyramidal feature hierarchy methods [8, 19] which require multi-stage features, our three paths for down/up-sampling pyramidal fusion construct the different resolution representations at a single stage. Compared to ASPP that enlarges the receptive field while maintaining feature resolution, our pyramidal fusion aims to generate various resolution features for both coarsely and finely locating the target class’s areas. Note that, we combine our pyramid feature fusion module and ASPP for feature fusion module and classification module, respectively, to construct the segmentation head.

Iii-E Self-Prototype Training Branch

In existing approaches, the network learned during training is directly applied to the one-shot segmentation task for testing without fine-tuning. Therefore, during testing, the support image-mask pair is only used to provide guidance information about the target class for the segmentation of the query image. Nevertheless, the support image-mask pair can be further utilized during training. To generate a robust class prototype that is informative to guide the network for locating the corresponding target area, we propose a self-prototype guidance branch at the time of training. We expect that the target class prototype generated from the conditioning branch can effectively guide the segmentation of the support image itself. On top of the fused features of support feature maps and the up-sampled support prototype, we apply the same query image segmentation head to get the binary segmentation map for the support image. The benefits of the self-prototype guidance branch are threefold. First, it provides more supervision for the segmentation head applied on top of fused features. This can alleviate the confusion of the segmentation head when the differences between the support and query features are too significant. Second, the self-prototype guidance branch ensures that the class prototype extracted from support features can effectively help locate the target class in the support image in turn. This constrains the network to generate more compact features and robust prototype for each semantic class. Third, it also echoes with the fused prototype guidance branch during testing (in Section III-G), which leverages the self-prototype (pseudo-prototype) of the query image to guide the final segmentation of the query image. Similar to Equation (2), the one-shot segmentation loss for the support image is defined as the cross-entropy loss between the prediction and support mask :

(4)

Here, is the class label that denotes whether the pixel belongs to the target class. is the size of the input support image and mask.

Iii-F Overall Training Objective

The overall training objective is to learn a one-shot image segmentation network. At the same time, we leverage the multi-class label information to encourage the feature encoder for learning semantically meaningful features for each category. The cost function is as follows:

(5)

Here, is the hyper-parameter that controls the relative importance of the one-shot image segmentation loss and the traditional semantic segmentation loss of the multi-class label guidance branch.

Iii-G Testing Architecture with Prototype Fusion

Despite our efforts to learn class-aware semantic features and extract the semantically meaningful prototype for the target class during training, the visual appearance and layout differences between the support and query images will make their features to be more or less different. Therefore, at the test time, we propose a fused prototype guidance branch to guide the segmentation of the query image with more robust class cues. We leverage the network to predict the segmentation map for the query image. Then we treat the predicted binary map as the pseudo-mask to extract the pseudo-prototype for the target class. The masked average pooling [46] is applied with the pseudo-mask to obtain the prototype vector for the target class from the query feature maps. We up-sample the query features to the same size of the pseudo-mask and then the prototype is calculated as:

(6)

where is the size of the query image and pseudo-mask. Benefiting from the self-prototype guidance branch during training, the network can be directly applied to the pseudo-prototype (self-prototype for the query image) setting. However, the pseudo-prototype of the query image might be noisy due to the coarseness of the pseudo-mask. As illustrated in Fig. 4, we then further fuse the pseudo-prototype of the query image with the support prototype to guide the final segmentation of the query image. Here, we combine them by averaging the two prototypes. Our experiments in Section V-B prove that utilizing the fused prototype can achieve better performance than using either the query pseudo-prototype or the support prototype only.

Method PASCAL- PASCAL- PASCAL- PASCAL- Mean
LogReg [42] 26.9 42.9 37.1 18.4 31.4
Siamese [42] 28.1 39.9 31.8 25.8 31.4
1-NN [42] 25.3 44.9 41.7 18.4 32.6
OSLSM [42] 33.6 55.3 40.9 33.5 40.8
co-FCN [43] 36.7 50.6 44.9 32.4 41.1
AMP [48] 41.9 50.2 46.7 34.7 43.4
SG-One [46] 40.2 58.4 48.4 38.4 46.3
PANet [47] 42.3 58.0 51.1 41.2 48.1
Ours 50.6 61.9 49.4 48.4 52.6
TABLE I: Results of one-shot segmentation on PASCAL- dataset evaluated with mean-IoU metric.
Method binary-IoU
FG-BG [43] 55.0
Fine-tuning [43] 55.1
co-FCN [43] 60.1
OSLSM [42] 61.3
PL [44] 61.2
A-MCG [45] 61.2
AMP [48] 62.2
SG-One [46] 63.9
PANet [47] 66.5
Ours 68.7
TABLE II: Results of one-shot segmentation on PASCAL- dataset evaluated with the binary-IoU metric. The experiments are conducted on 4 splits and the average results are reported.
Fig. 5: Example results of one-shot image segmentation on PASCAL- dataset. From top to down, we demonstrate the support image with the mask label that defines the target object class, the query image with ground truth label, the query image with the prediction of PANet, and the query image with the prediction of our method. Best viewed in color.
Method COCO- COCO- COCO- COCO- Mean
PANet [47] - - - - 20.9
Ours 32.2 23.5 19.6 19.0 23.6
TABLE III: Results of one-shot segmentation on COCO- dataset evaluated with mean-IoU metric.

Iv Experiments

Iv-a Datasets and Evaluation Metrics

Datasets. We train and evaluate our proposed method for one-shot segmentation on the PASCAL- [42] and COCO- [45] datasets. The PASCAL-

is built from the dataset of PASCAL VOC 2012

[67] which is expanded by SBD [68]. The 20 semantic classes in PASCAL VOC are evenly divided into 4 splits, each containing 5 classes. The COCO- is created from the MS COCO dataset [69] that contains 80 foreground categories. Similarly, these categories are evenly divided into 4 splits, and each split contains 20 object categories. Experiments are done in a cross-validation manner. We train the model on the seen classes in 3 splits and test the model for the unseen classes in the rest split.
Evaluation Metrics.

Intersection-over-Union (IoU) is taken as the evaluation metric:

(7)

where , , and are the numbers of true positive, false positive, and false negative pixels, respectively. We adopt the mean-IoU as the primary metric for the model evaluation. We first calculate a standard Intersection-over-Union (IoU) for each foreground class given the predicted masks of the split. We then average the class-wise IoU of all the 5 classes as the mean-IoU for this split. To compare our method with the early approaches, we also report the Binary-IoU that treats all object categories as one foreground class and averages the IoU of foreground and background.

Iv-B Implementation Details

For the feature encoder, we adopt the VGG-16 model [1]

pre-trained on ImageNet

[66] as our backbone. We remove the last two pooling layers to make the resolution of output feature maps effectively 1/8 times the input image size. To enlarge the receptive field, we apply Atrous Convolution [17] in conv5 layers with a rate of 2. The fully-connected layers are replaced by two convolution layers with a dilated rate of 4. We utilize the Atrous Spatial Pyramid Pooling (ASPP) [17] as the classifier for both the multi-class label guidance branch and the foreground-background one-shot image segmentation. We employ an up-sampling layer along with the softmax output of the classifiers to match the size of the input image.

Following the setting in PANet [47], input images are resized to (417,417) and augmented using random horizontal flipping. We also average the results from 5 runs with different random seeds on both the PASCAL- [42] and COCO- [45] datasets. Each run contains 1,000 episodes. We use SGD [70] as the optimizer. The momentum and weight decay of SGD are 0.9 and . The initial learning rate is set to and is decreased using the polynomial decay with a power of 0.9. We set batch size = 1 for training on PASCAL- and set batch size = 4 for training on MS COCO-. We conduct the parameter search to choose best parameters for the framework. The detailed parameter analysis for , batch size and the initial learning rate can be found in Section V-D.

Iv-C Baselines

We compare our one-shot semantic segmentation method with the following state-of-the-art (SOTA) methods: 1-NN [42], LogReg [42], Siamese [42], OSLSM [42], FG-BG [43], Fine-tuning [43], co-FCN [43], PL [44], A-MCG [45], SG-One [46], PANet [47], AMP [48].

Iv-D Experimental Results

We compare our approach with state-of-the-art methods on both PASCAL- and COCO- datasets. Experimental results of one-shot segmentation on the PASCAL- dataset evaluated with mean-IoU metric are shown in Table I. From Table I, we can observe that our method achieves the best segmentation results compared to other state-of-the-art approaches on split-1, split-2, and split-4. Compared with other parametric methods [42, 43, 48, 46], our method improves the average result of the 4 splits from 46.3% to 52.6% mean-IoU. Our approach also outperforms the non-parametric metric learning method PANet [47] by 4.5 % mean-IoU. As shown in Table II, when evaluated with the binary-IoU metric, our approach achieves the best average results of the 4 splits with 68.7% binary-IoU. Our performance is 2.7% higher than the second-best result reported by PANet [47].

The experimental results of one-shot segmentation on MS COCO- dataset using mean-IoU and binary-IoU metric are reported in Table III and Table IV, respectively. Our method achieves better or comparable segmentation results compared to other state-of-the-art methods. Fig. 5 presents some qualitative segmentation examples for the one-shot task on the PASCAL- dataset. The first row gives the support image with the binary mask annotation that defines the target object class. The second row shows the query image with the ground truth label. The third and fourth rows show the query image with the prediction of PANet [47] and our method, respectively. It can be seen that our proposed approach successfully segments the objects from these query images.

Method binary-IoU
A-MCG [45] 52
PANet [47] 59.2
Ours 58.7
TABLE IV: Results of one-shot segmentation on COCO- dataset evaluated with the binary-IoU metric. The experiments are conducted on 4 splits and the average results are reported.

V Ablation Studies

V-a Element-Wise Component Analysis

In this part, we demonstrate the contribution of each component of our training framework to the one-shot image segmentation task. The experimental results on PASCAL- are given in Table V. Specifically, experiments are conducted on four splits, and we report the mean result of the 4 splits. The first row of Table V shows the performance of our base network (described in Section III-B). By observing Table V, we can notice that our approach improves the mean-IoU of the base network from 46.4% to 49.7% by introducing the multi-class label information. Furthermore, utilizing our proposed pyramid feature fusion module, we can have another 1.1% mean-IoU performance gain. However, as shown in Table V, if we do not leverage the multi-class label information, the pyramid feature fusion module does not improve the performance of the base network. On the contrary, it brings a 0.7% mean-IoU performance drop. This highlights the importance of the introduction of multi-class label information, which can encourage the feature encoder to generate more discriminative features. We argue that less semantically meaningful features are easier to cause the over-fitting of subsequent fusion networks for the guided features. It will thus deteriorate the segmentation result. With our proposed self-prototype training branch, we can further improve the result to 51.3% mean-IoU.

MCL PFF SPT Mean
46.4
45.7
49.7
50.8
51.3
TABLE V: Element-wise component analysis for the training framework. The one-shot semantic segmentation results on PASCAL- dataset are evaluated with mean-IoU metric. The experiments are conducted on 4 splits and the average results are reported. MCL: Multi-Class Label Guidance, PFF: Pyramid Feature Fusion Module, SPT: Self-Prototype Training Branch.
Method Mean
Support Prototype 51.3
Pseudo-Prototype 52.0
Prototype Fusion 52.6
TABLE VI: Ablation study for prototype fusion during testing. The results of one-shot semantic segmentation on the PASCAL- dataset are evaluated with mean-IoU metric. The experiments are conducted on 4 splits and the average results are reported.
Method Mean
Our w/ MCL 52.6
Our w/o MCL 49.5
Frozen Encoder 50.0
TABLE VII: The ablation studies for the influence of multi-class label guidance on the final performance of one-shot semantic segmentation on PASCAL- dataset. Results are evaluated with the mean-IoU metric. The experiments are conducted on 4 splits and the average results are reported. MCL: Multi-Class Label Guidance.
Fig. 6: The parameter sensitivity of weight for one-shot semantic image segmentation on PASCAL- dataset. Results are evaluated with the mean-IoU metric. The experiments are conducted on 4 splits and the average results are reported.
Fig. 7: The parameter sensitivity of the initial learning rate and batch size for one-shot semantic image segmentation on PASCAL- dataset. Results are evaluated with the mean-IoU metric. The experiments are conducted on 4 splits and the average results are reported.

V-B Prototype Fusion for Testing

In this part, we compare the performance of our prototype fusion strategy with that of using either the query pseudo-prototype or the support prototype only. Table VI gives the one-shot image segmentation results on PASCAL- dataset. As we can see, compared to using the support prototype, leveraging the pseudo-prototype of the query branch to guide the segmentation of the query image brings 0.7% performance gain. The explanation is that due to the difference in the visual appearance of each image, the pseudo-prototype of the query image itself can provide more accurate guidance information of the target class for the query image, even if there exist noisy labels in the pseudo-mask. It should be noted that our self-prototype training branch also contributes to the improvement by adapting the network to fit the self-prototype guidance setting. Moreover, with our prototype fusion strategy, we further improve the mean-IoU to 52.6%, which outperforms using either the query pseudo-prototype or the support prototype only.

V-C Influence of Multi-Class Label Guidance

As shown in Table VII, to further investigate the influence of the multi-class label guidance branch on the final segmentation results, we conduct experiments for the one-shot task on PASCAL- dataset when taking into consideration the following situations. a) Ours w/ MCL: Our full network with the multi-class label guidance branch. b) Ours w/o MCL: Our network without the multi-class label guidance branch. c) Frozen Encoder: We train the network without the multi-class label guidance branch, and we also freeze parameters of the feature encoder. In other words, we directly load the ImageNet pre-trained weights to extract features for further processing. The last two pooling layers and full-connected layers are removed. From Table VII, we can notice that our proposed method with the multi-class label guidance branch brings the best segmentation performance. Without the multi-class label guidance, the performance of the network falls from 52.6% to 49.5%. We also notice that, without the multi-class label guidance, the performance of the network is even slightly worse than the result of loading the Imagenet pre-trained weights to extract features. We conjecture that the feature encoder trained without the multi-class label guidance generates less semantically meaningful features than the frozen encoder. This highlights the importance of leveraging the multi-class label guidance to extract the class-aware semantic feature representations for the one-shot semantic image segmentation.

V-D Parameter Analysis

For the parameter sensitivity of weight in the multi-class label guidance branch, we evaluate the segmentation accuracy using the full proposed method on the PASCAL- dataset. We vary over the range . The mean results are reported in Fig. 6. As we can see, we get stable and better performance when is between 0.075 and 0.2. We notice that a too small or large can not facilitate the training process very much. According to the results, we empirically set = 0.1 in our experiments.

For the parameter sensitivity of the initial learning rate and batch size, we conduct a parameter search to study their effects on the overall performance for one-shot semantic image segmentation. The experimental results on PASCAL- dataset are demonstrated in Fig. 7. As we can see, if the initial learning rate is relatively small, the better result is achieved when batch size = 1. If we increase the initial learning rate, we would better increase the batch size accordingly to get a better result. For example, it is suggested to set batch size as 2 and 4 when increasing the initial learning rate to and , respectively. According to Fig. 7, we set the initial learning rate = and batch size = 1 for the best final result of 52.6%.

V-E Speed Comparison

In this part, we compare the speed of our approach with state-of-the-art methods. We reproduce the methods of SG-One [46] and PANet [47] with their officially released codes. We run the one-shot segmentation task for 1,000 episodes during testing and report the average speed in Table VIII. We also report the GPU and CPU consumption during testing for a fair comparison. As can be seen in Table VIII, our approach can reach the speed of 3.5 FPS, which is comparable to the speed of PANet [47]. Though SG-One [46] can reach the speed of 11.6 FPS, it requires much more computation resources, especially the CPU usage.

Method GPU CPU FPS
SG-One [46] 7.0G 18 11.6
PANet [47] 1.8G 1 3.7
Ours 2.6G 1 3.5
TABLE VIII: The Comparison of the speed for our approach with state-of-the-art methods on one-shot semantic segmentation.

V-F Test with Weak Annotations

We further evaluate our model with the setting of weak annotations during testing. We replace the pixel-level dense annotations of the support set with scribbles or bounding boxes. Note that we only change the annotation setting of the support image during testing and use the same model trained with dense annotations. Following PANet [47], these annotations are generated from the dense segmentation masks automatically. A random instance mask in each support image is chosen as the bounding box. The segmentation results on PASCAL- are given in Table IX. We can notice that our method is much more robust than PANet [47], especially for the test setting of scribble annotation. While our method only experiences a 0.7% mean-IoU performance drop when changing dense annotation to scribble, the segmentation result of PANet [47] deteriorates from 48.1% to 44.8% seriously. Besides, our result of bounding box annotation is also very close to that of dense annotation. We notice that the performance of using scribble annotations is better than that of using bounding boxes in our experiment. The reason is that bounding boxes annotation will bring erroneous labels and make the support prototype less accurate, leading to a worse result than scribble annotation. Fig. 8 shows qualitative results of using scribble and bounding box annotations on the PASCAL- dataset. By observing Fig. 8, we can find that even with weak annotations, our trained network can segment the objects successfully.

Fig. 8: The qualitative results of our method on one-shot segmentation using scribble (left), bounding box (middle) and dense (right) annotations. The scribbles are dilated for better visualization. Best viewed in color.
Annotations PANet [47] Ours
Dense 48.1 52.6
Scribble 44.8 51.9
Bounding box 45.1 50.9
TABLE IX: The performance of one-shot semantic segmentation with weak annotations during testing on PASCAL- dataset. Results are evaluated with the mean-IoU metric. The experiments are conducted on 4 splits and the average results are reported.
Fig. 9: The results of k-shot segmentation on PASCAL- dataset evaluated with mean-IoU metric. The experiments are conducted on 4 splits and the average results are reported.

V-G Extension to K-shot Segmentation

In the case of k-shot image segmentation, k labeled support images that contain the target object are provided to guide the segmentation of the query image. We directly use the trained network in the one-shot manner to test the segmentation performance by using k support images. We average the k target class prototype vectors extracted from each support image, and then use the averaged vector to guide the query image segmentation. In Fig. 9, we compare the result of averaging the k support prototypes only and the prototype fusion strategy that leverages the pseudo-prototype of query image (averaging k+1 prototypes). We notice that the fewer pictures, the more pronounced the advantages of the prototype fusion strategy. With more support images, the effect of the prototype fusion is gradually diluted.

V-H Limitation

In this part, we discuss the limitation of our proposed approach. From Fig. 9, we can find that more support images bring better segmentation results. However, the improvement encounters a bottleneck when the number of support images is more extensive than two. Therefore, it is essential to finetune the current model to take further advantage of more support images. In future work, we will study an efficient way to finetune the model when more support images are given. Various data augmentation strategies will also be investigated to facilitate the finetuning process.

Vi Conclusion

In this work, we proposed to leverage the multi-class label information to constrain the network during the episodic training of one-shot semantic image segmentation. Compared with existing methods, our proposed approach can help the network to extract class-aware semantic feature representations from the input images and thus generate a more semantically meaningful class prototype for the target category. In addition, we proposed to leverage a pyramid feature fusion module to mine the fused features for the target class in a multi-scale way. To take more advantage of the support image-mask pair, we further proposed a self-prototype guidance branch for support image segmentation. During testing, we proposed to combine the pseudo-prototype of the query image and the prototype from the support image to guide the final segmentation of the query image. Extensive experimental results on PASCAL- and COCO- datasets validated the superiority of our proposed approach.

References

  • [1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [2] Y. Yao, F. Shen, G. Xie, L. Liu, F. Zhu, J. Zhang, and H. T. Shen, “Exploiting web images for multi-output classification: From category to subcategories,” vol. 31, no. 7, pp. 2348–2360, 2020.
  • [3] C. Zhang, Y. Yao, J. Zhang, J. Chen, P. Huang, J. Zhang, and Z. Tang, “Web-supervised network for fine-grained visual classification,” in Proc. IEEE International Conference on Multimedia and Expo.   IEEE, 2020, pp. 1–6.
  • [4] C. Zhang, Y. Yao, H. Liu, G.-S. Xie, X. Shu, T. Zhou, Z. Zhang, F. Shen, and Z. Tang, “Web-supervised network with softly update-drop training for fine-grained visual classification,” in Proc. AAAI Conference on Artificial Intelligence, 2020, pp. 12 781–12 788.
  • [5] Z. Li, J. Zhang, Y. Gong, Y. Yao, and Q. Wu, “Field-wise learning for multi-field categorical data,” 2020.
  • [6] Y. Yao, Z. Sun, F. Shen, L. Liu, L. Wang, F. Zhu, L. Ding, G. Wu, and L. Shao, “Dynamically visual disambiguation of keyword-based image search,” pp. 996–1002, 2019.
  • [7]

    B. Hu, R.-J. Song, X.-S. Wei, Y. Yao, X.-S. Hua, and Y. Liu, “Pyretri: A pytorch-based library for unsupervised image retrieval by deep convolutional neural networks,” pp. 4461––4464, 2020.

  • [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in

    Proc. European Conference on Computer Vision

    , 2016, pp. 21–37.
  • [9] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proc. IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 3431–3440.
  • [10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241.
  • [11] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, 2017.
  • [12] T. Chen, J. Zhang, G.-S. Xie, Y. Yao, X. Huang, and Z. Tang, “Classification constrained discriminator for domain adaptive semantic segmentation,” in Proc. IEEE International Conference on Multimedia and Expo.   IEEE, 2020, pp. 1–6.
  • [13] J. Lu, H. Liu, Y. Yao, S. Tao, Z. Tang, and J. Lu, “Hsi road: A hyper spectral image dataset for road segmentation,” in Proc. IEEE International Conference on Multimedia and Expo.   IEEE, 2020, pp. 1–6.
  • [14] H. Luo, G. Lin, Z. Liu, F. Liu, Z. Tang, and Y. Yao, “Segeqa: Video segmentation based visual attention for embodied question answering,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 9667–9676.
  • [15] T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao, “Motion-attentive transition for zero-shot video object segmentation.” in Proc. AAAI Conference on Artificial Intelligence, 2020, pp. 13 066–13 073.
  • [16] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
  • [17] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2017.
  • [18] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.
  • [19] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1925–1934.
  • [20] G. Gao, H. Zhang, H. Hu, Y. Wen, J. Cai, C. Luo, and W. Zeng, “Optimizing quality of experience for adaptive bitrate streaming via viewer interest inference,” IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3399–3413, 2018.
  • [21] C. Zhan, H. Hu, Z. Wang, R. Fan, and D. Niyato, “Unmanned aircraft system aided adaptive video streaming: A joint optimization approach,” IEEE Trans. Multimedia, vol. 22, no. 3, pp. 795–807, 2019.
  • [22] Y. Li, H. Hu, J. Li, Y. Luo, and Y. Wen, “Semi-supervised online multi-task metric learning for visual recognition and retrieval,” in Proc. ACM International Conference on Multimedia, 2020, pp. 3377–3385.
  • [23] P.-T. Jiang, Q. Hou, Y. Cao, M.-M. Cheng, Y. Wei, and H.-K. Xiong, “Integral object mining via online attention accumulation,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 2070–2079.
  • [24] H. Liu, C. Zhang, Y. Yao, X. Wei, F. Shen, J. Zhang, and Z. Tang, “Exploiting web images for fine-grained visual recognition by eliminating open-set noise and utilizing hard examples,” IEEE Trans. Multimedia, 2021.
  • [25] Y. Yao, X. Hua, G. Gao, Z. Sun, Z. Li, and J. Zhang, “Bridging the web data and fine-grained visual recognition via alleviating label noise and domain mismatch,” in Proc. ACM International Conference on Multimedia, 2020, pp. 1735–1744.
  • [26] Z. Sun, X.-S. Hua, Y. Yao, X.-S. Wei, G. Hu, and J. Zhang, “Crssc: salvage reusable samples from noisy data for robust learning,” in Proc. ACM International Conference on Multimedia, 2020, pp. 92–101.
  • [27] C. Zhang, Y. Yao, X. Shu, Z. Li, Z. Tang, and Q. Wu, “Data-driven meta-set based fine-grained visual recognition,” in Proc. ACM International Conference on Multimedia, 2020, pp. 2372–2381.
  • [28] Y. Yao, J. Zhang, F. Shen, L. Liu, F. Zhu, D. Zhang, and H. T. Shen, “Towards automatic construction of diverse, high-quality image datasets,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 6, pp. 1199–1211, 2019.
  • [29] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in Proc. European Conference on Computer Vision, 2018, pp. 289–305.
  • [30] Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang, “Exploiting web images for dataset construction: A domain robust approach,” IEEE Trans. Multimedia, vol. 19, no. 8, pp. 1771–1784, 2017.
  • [31] G.-S. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao, “Attentive region embedding network for zero-shot learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9384–9393.
  • [32] G.-S. Xie, L. Liu, F. Zhu, F. Zhao, Z. Zhang, Y. Yao, J. Qin, and L. Shao, “Region graph embedding network for zero-shot learning,” in Proc. European Conference on Computer Vision.   Springer, 2020, pp. 562–580.
  • [33] Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang, “A new web-supervised method for image dataset constructions,” Neurocomputing, vol. 236, pp. 23–31, 2017.
  • [34] H. Zhang, Y. Gu, Y. Yao, Z. Zhang, L. Liu, J. Zhang, and L. Shao, “Deep unsupervised self-evolutionary hashing for image retrieval,” IEEE Trans. Multimedia, 2021.
  • [35] Y. Yao, X.-s. Hua, F. Shen, J. Zhang, and Z. Tang, “A domain robust approach for image dataset construction,” in Proc. ACM International Conference on Multimedia, 2016, pp. 212–216.
  • [36] W. Wang, Y. Shen, H. Zhang, Y. Yao, and L. Liu, “Set and rebase: determining the semantic graph connectivity for unsupervised cross modal hashing,” in Proc. International Joint Conference on Artificial Intelligence, 2020, pp. 853–859.
  • [37] Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang, “Automatic image dataset construction with multiple textual metadata,” in Proc. IEEE International Conference on Multimedia and Expo.   IEEE, 2016, pp. 1–6.
  • [38] Y. Yao, F. Shen, J. Zhang, L. Liu, Z. Tang, and L. Shao, “Extracting privileged information for enhancing classifier learning,” IEEE Trans. Image Process., vol. 28, no. 1, pp. 436–450, 2018.
  • [39] L. Ding, S. Liao, Y. Liu, L. Liu, F. Zhu, Y. Yao, L. Shao, and X. Gao, “Approximate kernel selection via matrix approximation,” vol. 31, no. 11, pp. 4881–4891, 2020.
  • [40] Y. Yao, J. Zhang, F. Shen, W. Yang, P. Huang, and Z. Tang, “Discovering and distinguishing multiple visual senses for polysemous words,” in Proc. AAAI Conference on Artificial Intelligence, 2018, pp. 523–530.
  • [41] Z. Sun, Y. Yao, J. Xiao, L. Zhang, J. Zhang, and Z. Tang, “Exploiting textual queries for dynamically visual disambiguation,” Pattern Recognition, vol. 110, p. 107620, 2021.
  • [42] A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots, “One-shot learning for semantic segmentation,” in Proc. British Machine Vision Conference, 2017, pp. 167.1–167.13.
  • [43] K. Rakelly, E. Shelhamer, T. Darrell, A. Efros, and S. Levine, “Conditional networks for few-shot semantic segmentation,” in Proc. International Conference on Learning Representations, 2018.
  • [44] N. Dong and E. P. Xing, “Few-shot semantic segmentation with prototype learning.” in Proc. British Machine Vision Conference, vol. 3, no. 4, 2018.
  • [45] T. Hu, P. Yang, C. Zhang, G. Yu, Y. Mu, and C. G. Snoek, “Attention-based multi-context guiding for few-shot semantic segmentation,” in Proc. AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 8441–8448.
  • [46] X. Zhang, Y. Wei, Y. Yang, and T. S. Huang, “Sg-one: Similarity guidance network for one-shot semantic segmentation,” IEEE Transactions on Cybernetics, 2020.
  • [47] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng, “Panet: Few-shot image semantic segmentation with prototype alignment,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 9197–9206.
  • [48] M. Siam, B. N. Oreshkin, and M. Jagersand, “Amp: Adaptive masked proxies for few-shot segmentation,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 5249–5258.
  • [49] C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen, “Canet: Class-agnostic segmentation networks with iterative refinement and attentive few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5217–5226.
  • [50] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.
  • [51] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet: Criss-cross attention for semantic segmentation,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 603–612.
  • [52] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
  • [53] M. Zhen, J. Wang, L. Zhou, S. Li, T. Shen, J. Shang, T. Fang, and L. Quan, “Joint semantic segmentation and boundary detection using iterative pyramid contexts,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 666–13 675.
  • [54] Y. Li, L. Song, Y. Chen, Z. Li, X. Zhang, X. Wang, and J. Sun, “Learning dynamic routing for semantic segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 8553–8562.
  • [55] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
  • [56] T. Munkhdalai and H. Yu, “Meta networks,” in

    Proc. International Conference on Machine Learning

    , 2017, pp. 2554–2563.
  • [57] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,” in Proc. Advances in Neural Information Processing Systems, 2017, pp. 4077–4087.
  • [58] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales, “Learning to compare: Relation network for few-shot learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1199–1208.
  • [59] Z. Wu, Y. Li, L. Guo, and K. Jia, “Parn: Position-aware relation networks for few-shot learning,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 6659–6667.
  • [60] Z. Peng, Z. Li, J. Zhang, Y. Li, G.-J. Qi, and J. Tang, “Few-shot image recognition with knowledge transfer,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 441–449.
  • [61] Y. Wang, Q. Yao, J. T. Kwok, and L. M. Ni, “Generalizing from a few examples: A survey on few-shot learning,” ACM Computing Surveys (CSUR), vol. 53, no. 3, pp. 1–34, 2020.
  • [62] N. Dvornik, C. Schmid, and J. Mairal, “Diversity with cooperation: Ensemble methods for few-shot classification,” in Proc. IEEE International Conference on Computer Vision, 2019, pp. 3723–3731.
  • [63] C. Zhang, Y. Cai, G. Lin, and C. Shen, “Deepemd: Few-shot image classification with differentiable earth mover’s distance and structured classifiers,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 203–12 213.
  • [64] P. Wang, L. Liu, C. Shen, Z. Huang, A. van den Hengel, and H. Shen, “Multi-attention network for one shot learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2721–2729.
  • [65] S. Wang, S. Cao, D. Wei, R. Wang, K. Ma, L. Wang, D. Meng, and Y. Zheng, “Lt-net: Label transfer by learning reversible voxel-wise correspondence for one-shot medical image segmentation,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9162–9171.
  • [66] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [67] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [68] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik, “Semantic contours from inverse detectors,” in Proc. IEEE International Conference on Computer Vision, 2011, pp. 991–998.
  • [69] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proc. European Conference on Computer Vision, 2014, pp. 740–755.
  • [70]

    L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in

    Proc. International Conference on Computational Statistics, 2010, pp. 177–186.