Semantic segmentation, one of the most challenging tasks in computer vision, aims to assign a categorical label to each pixel of an image according to its enclosing object or region. In the past few years, a number of deep-neural-network-based approaches such as FCN, DeepLab  and PSPNet  have been proposed for the semantic segmentation task. However, these approaches typically require large-scale pixel-level annotations for training their model parameters, which are expensive to obtain. Some weakly- and semi-supervised segmentation models were proposed to reduce the dependence on pixel-level annotated data but still suffer from the issue of model generalization, which makes them hard to be applied to unseen categories.
Recently, there has been increasing interest in the study of few-shot learning [30, 5, 6, 27], which is learning a novel concept with a few labeled examples, mostly focusing on the image classification task. The purpose of the few-shot segmentation problem is to learn a model that can perform segmentation on novel classes with only a few pixel-level annotated images. Existing studies [3, 31, 33] on few-shot segmentation are based on meta learning, where the model learns a metric space across different training tasks to employ segmentation on new classes with similarity measurements, like distance-based or deep metric-based methods. However, the current setup of each episode is different from how humans learn new concepts in many dimensions, which may limit the learning ability of the model. For instance, by showing a few annotated examples and a set of unlabeled images that contain the target object, even if the location of the target object in the image is not specified, humans are still capable of inferring various forms of a new concept and learning some extra knowledge from these unlabeled images.
To better mimic human learning behaviors, we consider a new scenario where the segmentation of novel classes are learned in the combination of pixel- and image-level annotations, which is shown in Figure 1. Specifically, in a -way -shot segmentation task, we aim to perform segmentation on the query images with pixel-level and image-level annotated images from each of the classes. It is worth emphasizing that mixing strong (pixel-level) and weak (image-level) annotations is a widely used setting to improve the model performance in the existing semantic segmentation works [19, 8, 16, 28]
. But this paper is the first time to consider doing so in the few-shot segmentation task. Such image-level labeled images could be obtained from crowd-sourcing or existing public image datasets like ImageNet. In many cases, it may be much easier for a human to determine the existence of the target object in the image than making a pixel-level annotation.
To the best of our knowledge, our proposed approach is the first few-shot semantic segmentation model that can integrate weak image-level annotations into traditional pixel-level labels. It is nontrivial to consider both pixel-level and image-level labels simultaneously in the few-shot semantic segmentation task because of 1) the discrepancies between pixel-level segmentation labels and image-level weak annotation; and 2) the distraction of segmentation object in image-level annotations. To tackle the above issues, we propose a few-shot semantic segmentation model augmented with image-level labels (FSIL), as shown in Figure 1. To sum up, our main contributions are as follows: We first propose a class prototype augmentation method to learn the prototype representation in the metric space by utilizing image-level annotations. Moreover, we propose a soft-masked average pooling strategy for enhanced prototype generation to handle distraction in image-level annotation. In addition, extensive empirical results on PASCAL- show that our FSIL method can achieve 5.1% and 8.2% increases in mIoU score for one-shot settings with pixel-level and scribble annotations, respectively.
The remainder of this paper is organized as follows. Section 2 reviews related work in the few-shot semantic segmentation and weakly-supervised segmentation categories. Section 3 provides a formal problem formulation. The proposed FSIL model is presented in Section 4. In Section 5, the experimental results are analyzed and the paper concludes with a summary of our work in Section 6.
2 Related Work
The prior work related to this paper is summarized below in the categories of semi- and weakly-supervised segmentation, few-shot learning and few-shot semantic segmentation.
2.1 Semi-supervised and Weakly-supervised Segmentation
In light of the expensive cost of annotated pixel-level labels, the segmentation task has recently received increasing interest in the semi-and weakly-supervised training schemes. Weakly-supervised methods aim to use weak labels like image-level class labels [19, 32] and bounding boxes [2, 12] to train the segmentation model, achieving the similar performance as supervised models. And the general idea of existing work [28, 10, 17] on the semi-supervised segmentation task is to employ a GAN-based method that learns the distribution of images and generate additional images to improve the performance of the segmentation network. However, these approaches still require large training sets, which limits their widespread use in real applications.
2.2 Few-shot Learning
Few-shot learning aims to learn general knowledge that can be applied to learn a novel class with a few examples. Most of the proposed methods are deep meta-learning models, which try to optimize the base learner using learning experiences from multiple similar tasks. One class of existing methods for few-shot learning is gradient-based methods, which aim to adapt the model to a novel class within a few fine-tuning updates [5, 6, 11]. The typical model in this class is MAML  which learns the initial parameters of the base learner to make the model fast adapt to a novel task. Finn et al.  extends it with a probabilistic algorithm to train with variational approximation. Another class of few-shot learning approaches is metric learning, which is learning a metric space across different tasks [13, 30, 27]. Prototypical Network 
takes an average over all sample embedding to present each class and performs nearest-neighbor matching to classify data. Relation Network replaces distance-based prediction with a learning relation module to compare the relation between each class.
2.3 Few-shot Semantic Segmentation
The few-shot semantic segmentation task is first solved by using the support branch to predict the weights of the last layer in the query branch . AMP  adopts a similar idea in which the convolutional filters of the final query segmentation layer are imprinted by the embedding of the support set. Most of the existing approaches on few-shot semantic segmentation are proposed based on metric learning methods [3, 33, 31, 18]. CANet  employs the same embedding module for both the support and query sets, and it learns a dense comparison module to a segment. Prototype learning-based methods are performed on the few-shot segmentation problem [3, 31]. PANet  takes the averaged feature embedding in a pixel-level as the class prototype and trains the metric space with prototype alignment constrains between support and query prototypes.
3 Problem Setting
Our purpose is that a model trained on a large labeled dataset can make a segmentation prediction on a testing dataset with a few annotated examples. The class set in has no overlap with , i.e., . Following previous work [31, 33], we adopted an episodic paradigm in the few-shot segmentation task. In particular, given an -way, -shot task, a set of episodic and are sampled from and , respectively. Each episode is composed by 1) a support set , containing pairs for each of the categories in the foreground, where represents the pair of support image and its corresponding binary mask for the foreground class, 2) a query set , which contains different query sample pairs from the same categories, and 3) an auxiliary set , containing image-level labeled images for each of the same categories, but no pixel-level annotation is available. The set of all target classes in the foreground for episode denotes as , and . For each episode , the model is supposed to segment images from with the combination of and . As each foreground class contains many pixel-level labeled data pairs and image-level labeled data, the model is trained on different combination-samples avoiding over-fitting.
To introduce our model, we first show a global view of the framework. Then we describe the first stage of the model: prototype representation learning. Finally, we introduce the second stage of the model, which contains soft-masked average pooling, distilled soft-masked average pooling, and an iterative fusion module.
4.1 Overall Architecture
We propose a new framework that can solve the few shot segmentation problem with a combination of pixel- and image-level labeled data. The main idea of our model is to learn a better prototype representation of the class by fusing the knowledge from the image-level labeled data. Specifically, the original prototypes are first obtained on the support set and are used to classify these image-level annotated images. The most confidently predicted pixels are added to the support set with the prediction of the original prototypes as the pseudo labels. To this end, we propose a novel prototype fusion strategy that contains the distilled soft-masked average pooling method and iterative fusion module. Figure 2 shows an overview of our model. In particular, for each episode, it first obtains the embedding features of the support, auxiliary, and query images via the feature encoder module. The original prototypes are computed by using masked average pooling over the support feature and mask. Then the Iterative Fusion Module (IFM) segments the image-level labeled images via the original prototypes and re-feeds the embedding features of those image-level labeled features with the proposed distilled soft-masked average pooling method. In the end, we predict the mask of the query image by the fused prototypes obtained from the Iterative Fusion module.
4.2 Prototype Representation Learning
Inspired by Wang et al. , we represent each category of segmentation task as a prototype in the metric space. The original class-specific prototypes are obtained by employing masked average pooling over the support set, which averages the features of the pixels only belonging to the support classes, and each pixel of the query image is labeled by its nearest prototypes in the metric space. Thus, the prototype of the foreground class is defined as follows:
where is the subset of support belonging to class . We have and , where and denote the width and height of the image, respectively. is defined as a feature encoder function that encodes the image to the feature space and denotes the masked average pooling function. Moreover, the background prototype is computed by averaging all the features of the pixels that do not belong to any foreground class in .
Accordingly, we obtain a prototype set containing all prototypes for each episode , , . Then we compute the distances between embedding features of each pixel in the image and
. The probability ofbelonging to class , , , is formed as follows:
where denotes the predicted class label of the pixel . Following previous work , the distance function is adopted as the cosine distance multiplying a factor , and the multiplier is fixed at 20.
4.3 Prototype Fusion with Soft-masked Average Pooling
Our model enhances the prototypes by extracting more class representation knowledge from the additional image-level annotations. The most intuitive way to incorporate those image-level annotations is to obtain their pseudo masks by employing the segmentation prediction method in Eq.4 and directly adding them into the support set. However, this process may introduce some noise into the support set. Since the original prototypes can be biased due to data scarcity, it may lead to inaccurate prediction results of image-level labeled data. To tackle this issue, we propose a soft-masked pooling method. Instead of assigning the same weight to each pixel belonging to the support class, we give them a partial assignment based on their probability of falling into the class. Pixels with lower predicted confidence would get lower weights preventing them from distracting the original prototypes. Specifically, for each foreground class , we first compute the predicted probability map and pseudo binary mask of based on Eq. 3 and Eq. 4, where the indicator is set to 1 if . A pseudo mask set is obtained for
. Then, we compute the representative vector by averaging the pixels within the object regions on the feature map. Thus, the soft-masked average pooling can be formed as:
In this way, the original prototypes can be enhanced by incorporating partial of image-level labeled samples. The fused prototypes can be computed as follows:
where and are the subset of auxiliary set and pseudo mask with respect to class , respectively.
4.4 Prototype Fusion with Distilled Soft-masked Average Pooling
In our problem setting, each image-level labeled image contains at least one foreground class in the episode. Therefore, the categories of referred segmentation mask belong to at least two of the classes (including the background class), which means there is no distractor in the image-level labeled dataset . However, when computing the prototype of background class, we treat all pixels not belonging to foreground classes as the same category. That means we cannot guarantee two support sets have similar background class representation even if their foreground classes are the same. Moreover, the image-level annotated images may contain an unseen object that does not show up in the background or does not belong to any foreground class in the support set. For example, in Figure 1, the second image in the auxiliary set has some potted plants in the background, which is never seen in the support set.
Under this circumstance, features of unlabeled pixels could still get a pseudo label with higher confidence even if they are far away from all prototypes in the metric space. So the uncertainty of these unseen objects in images may reduce the accuracy of fused prototypes. To alleviate the issue, we use a filter strategy for each prototype when applying soft-masked average pooling over those unlabeled images, which is called distilled soft-masked average pooling. Inspired by , we try to compute a threshold based on the statistics of the distances between pixels and the prototypes. Specifically, we first compute the distance between the prototype and the pixel , and obtain distance matrix of image for the prototype , . Then normalized distance set is obtained by normalizing each distance from the distance set . Finally, the filter threshold for the prototype in each episode is defined as follows:
For each foreground class , the distraction indicator of pixel can be computed as follows:
The indicator of for each categories is applied to filter some pixels that are not worth considering. The indicator set is defined as . This way, the model is forced to only extract objective class-related pixels instead of considering the whole image which may contain novel object classes in the background. Therefore, the Eq.6 for the fused prototype computation can be updated as follows:
where is the element-wise product between two vectors, and is the subset of indicator set belonging to class .
4.5 Iterative Fusion Module
Intuitively, if the knowledge extracted from the image-level annotations in the auxiliary set can improve the performance of our model, we can also utilize the image-level annotated images from the query set. Previous work  found that the initial prediction is an important clue about the rough position of the objects. Accordingly, we can update the fused prototype with the image-level annotations from query set as follows:
Here and are the pseudo mask set and indicator set of query set in episode , respectively. As the original prototypes are inevitably biased due to data scarcity, the confidence of the initial probability maps of those images may not be high enough to be considered. Therefore, we iteratively repeat the refinement for several steps to optimize the fusion prototypes in the Iterative Fusion Module (IFM). This process is shown in Figure 2
. In particular, we first compute the probability maps via the original prototypes and re-feeds the embedding features with distilled soft-masked average pooling to the IFM. Then we alternatively use fused prototypes in the last epoch to compute the probability maps.
To avoid diluting the knowledge learned from the support set during IFM, we adopt a similar idea from  to compute the . Different from predicting a segmentation mask on the support image by using the prototypes obtained from the prediction results of the query set, we compute the probability maps of support images and query images via the fused prototypes . Then compute and
by applying the standard cross-entropy function on their probability maps and pixel-level ground truth annotations, respectively. Thus the loss function for training our model is. This way, the model is forced to learn a consistent embedding space and retain the knowledge from the support set when integrating the prototypes.
We evaluate the performance of our model on two common few-shot segmentation datasets: PASCAL- and COCO-. PASCAL- dataset is proposed by Shaban et.al  and is created from PASCAL VOC 2012  with SBD  augmentation. The 20 categories in PASCAL VOC are evenly divided into 4 splits, each containing 5 categories. We use the rest of the images that do not have segmentation labels but have category information in PASCAL VOC 2012 as the auxiliary set. Similarly, COCO- is built from MS COCO  and 80 categories are split into 4 folds. For fair comparison, we adopt the same split strategy as the previous work . As all images in MS COCO have its corresponding segmentation label, we use images in its validation folder as the reliable set. Models are trained on 3 splits and evaluated on the rest one in a cross-validation fashion for both datasets. Following the same scheme for testing , we average the results from 5 runs with different random seeds, each run containing 1,000 episodes to get stable results. is used for all experiments.
5.1.2 Evaluation Metric
We adopt two common metrics in the few-shot segmentation task [31, 33, 24] to evaluate the model performance: mean-IoU and binary-IoU. Concretely, mean-IoU calculates the Intersection-over-Union of each class and takes the average IoU over all foreground classes. Binary-IoU ignores the difference between categories and treats all object categories in the support set as one foreground class and averages the IoU of foreground and background. As mean-IoU considers the differences between foreground classes, it can reflect the model performance more accurately than binary-IoU. Thus, we mainly use mean-IoU to report the experiment results.
5.1.3 Implementation details
We adopt a VGG-16 
network as the feature extractor following conventions. The first 5 convolutional blocks in VGG-16 are kept for feature extraction and the layers after the 5th convolutional block are removed. To maintain large spatial resolution, the stride of thelayer is equal to 1. The convolutions in the
block are replaced by dilated convolutions with rates 2 to enlarge the receptive field. For the MLP used in the distilled soft-masked pooling, we use a single hidden layer with 20 hidden units with a tanh non-linearity. For implementation, we use Pytorch. Following previous work [34, 31, 18], we pretrain the CNN on ImageNet . All images are resized to and augmented by random horizontal flipping. The network is trained end-to-end by SGD with a learning rate of 1e-3, momentum of 0.9 and weight decay of 5e-4. We train the model in 20,000 iterations and the batch size is 1. The learning rate is reduced by 0.1 after 10,000 iterations.
5.2 Comparison with the State-of-the-art Methods
We first compare our model with the state-of-the-art methods on PASCAL-5i dataset in 1-way segmentation. Table 1 shows the results in mean-IoU metric and Table 3 shows the results in binary-IoU metric. For fair comparison, we quote the result produced in VGG for the performance of 
. Our model outperforms the state-of-the-art methods under both evaluation metrics. Specifically, compared with in the mean-IoU metric, our model achieves an improvement of in the 1-way 1-shot task and in the 5-shot task, which means the combination of strong and weak annotated images can improve the performance in the few-shot segmentation task, especially when only one pixel-level annotated image is available.
We also present the results in 2-way 1-shot and 5-shot settings to validate the effectiveness of the model on multi-way few-shot segmentation tasks, as shown in Table 2. Our FSIL model outperforms previous works, especially on the 1-shot setting, surpassing the state-of-the-art method by 6.7%.
Table 4 shows the evaluation results on the MS COCO dataset. Compared to PASCAL VOC, MS COCO has more object categories, making the differences between two evaluation metrics more significant. Thus, we adopt mean-IoU score to evaluate the performance. Our model outperforms the previous PANet , which means that our model is able to extract class-related knowledge from the image-level annotations even if there are more unseen objects in it.
5.2.2 Ms Coco.
5.2.3 Qualitative Results.
5.3 Ablation Study
We implement extensive ablation experiments on the PASCAL- dataset to evaluate the effectiveness of different components in our network by using the mean-IoU metric in the 1-way 1-shot task. In Table 5, we compare our model with two baseline models. The first one does not adopt the distilled strategy when applying soft-masked pooling (DSMP), which is denoted as FSIL-Smp. The second one does not employ an additional iterative fusion module for the fused prototypes, i.e., the initial prediction from FSIL(FSIL-Init).
As shown in Table 5, the distilled soft-masked pooling method achieves a 2.2% improvement over the soft-masked pooling method. In addition, the iterative fusion module yields an improvement of 1.1% over the initial prediction. The combination of both modules achieves the best performance.
5.4 Analysis on the Number of Image-Level Annotations
From Fig. 5, we observe the improvements of segmentation performance when the number of image-level annotations increases in both the 1-shot and 5-shot learning. Specifically, the improvement of 1-shot learning is larger than 5-shot learning since the image-level annotations can provide more useful information when the pixel-level annotation is extremely limited. Moreover, we observe that the performance of annotation number 25 is slightly worse than 20 in 5-shot learning. It is mainly caused by the image-level annotations may introduce some distraction into the prototype representation.
5.5 Results on Weak Annotations
We evaluate FSIL with two types of weak annotations: scribble and bounding box. The pixel-level annotations of the support set are replaced by scribbles or bounding boxes. For fair comparison, we use the same annotation generation method proposed in : scribbles are generated from the dense segmentation masks automatically and bounding box is randomly chosen from instance mask.
As is shown in Table 6, for scribble annotations, our model achieves significant improvements of and in 1-shot and 5-shot tasks, respectively. This performance is comparable to the result with an expensive pixel-level annotated support set, which means our model works very well with sparse annotations. In addition, with bounding box annotations, our model significantly outperforms the state-of-the-art methods by for the 1-shot task and for the 5-shot task. This demonstrates that our model has a greater ability to withstand the noise introduced by the background area within the bounding box. Furthermore, the improvements of the performance in weak annotations validate the robustness of our model. Qualitative results of using scribble and bounding box annotations are shown in Fig. 6.
In this paper, a novel weak annotation augmented few-shot segmentation model is proposed to learn a refined prototypes based on both pixel-level segmentation labels and weak image-level annotations. To achieve this, we proposed a new framework to learn the class prototype representation in the metric space with the information of image-level annotations. Moreover, a soft masked average pooling method is designed for handling distraction in image-level weak annotations. Our evaluation results demonstrated the superiority of the proposed method over existing few-shot segmentation models by a significant margin.
-  (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §1.
-  (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1635–1643. Cited by: §2.1.
-  (2018) Few-shot semantic segmentation with prototype learning.. In BMVC, Vol. 3. Cited by: §1, §2.3, Table 1, Table 3.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §5.1.1.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §2.2.
-  (2018) Probabilistic model-agnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516–9527. Cited by: §1, §2.2.
-  (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §5.1.1.
-  (2015) Decoupled deep neural network for semi-supervised semantic segmentation. In Advances in neural information processing systems, pp. 1495–1503. Cited by: §1.
Attention-based multi-context guiding for few-shot semantic segmentation.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8441–8448. Cited by: Table 3.
-  (2018) Adversarial learning for semi-supervised semantic segmentation. arXiv preprint arXiv:1802.07934. Cited by: §2.1.
-  (2018) Learning to learn with conditional class dependencies. In International Conference on Learning Representations, Cited by: §2.2.
Simple does it: weakly supervised instance and semantic segmentation.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 876–885. Cited by: §2.1.
Siamese neural networks for one-shot image recognition.
ICML deep learning workshop, Vol. 2. Cited by: §2.2.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.1.1.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.
-  (2017) Deep dual learning for semantic image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2718–2726. Cited by: §1.
-  (2019) Semi-supervised semantic segmentation with high-and low-level consistency. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.1.
-  (2019) Feature weighting and boosting for few-shot segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 622–631. Cited by: §2.3, §5.1.3, §5.2.1, Table 1.
Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1742–1750. Cited by: §1, §2.1.
-  (2017) Automatic differentiation in pytorch. Cited by: §5.1.3.
-  (2018) Conditional networks for few-shot semantic segmentation. Cited by: Table 1, Table 3.
-  (2018) Meta-learning for semi-supervised few-shot classification. arXiv preprint arXiv:1803.00676. Cited by: §4.4.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1, §5.1.3.
-  (2017) One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410. Cited by: §2.3, §5.1.1, §5.1.2, Table 1, Table 3.
-  (2019) AMP: adaptive masked proxies for few-shot segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5249–5258. Cited by: §2.3, Table 1, Table 3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §5.1.3.
-  (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §1, §2.2.
-  (2017) Semi supervised semantic segmentation using generative adversarial network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5688–5696. Cited by: §1, §2.1.
-  (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §2.2.
-  (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1, §2.2.
-  (2019) Panet: few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9197–9206. Cited by: §1, §2.3, §3, §4.2, §4.2, §4.5, §5.1.1, §5.1.2, §5.1.3, §5.2.1, §5.2.1, §5.5, Table 1, Table 2, Table 3, Table 4, Table 6.
-  (2018) Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7268–7277. Cited by: §2.1.
-  (2019) Canet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5217–5226. Cited by: §1, §2.3, §3, §4.5, §5.1.2.
-  (2018) Sg-one: similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091. Cited by: §5.1.3, Table 1, Table 2, Table 3.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §1.