Deep learning has obviously improved the performance of many computer vision tasks such as classification [he2016deep], object detection [Ren2015Faster] and segmentation [chen2018encoder]. However, its drawback is the serious dependence on manual annotations that are very time-consuming to be generated, especially for dense prediction tasks such as image segmentation. To this end, weakly-supervised [ahn2019weakly] and semi-supervised manner [lake2011one, snell2017prototypical, alfassy2019laso] attract researchers’ attention.
Few-shot segmentation is a new semi-supervised segmentation task that predicts the foreground mask of unseen classes based on few of annotations only, aided by the annotations of classes that are already existed. The key step of such task is to learn the general knowledge from known classes that can be easily extended to unseen classes. The existing methods focus on solving the problem of how to transfer segmentation cues from support images (labeled images) to query images (unlabeled images), and try to learn a general transformation module that has the capacity of transferring the segmentation cues from support image to query image for various classes, so that the transferred cues can be used directly to guide the segmentation of query image for unseen classes. Based on such strategy, the existing few-shot segmentation framework is build as two-branch based segmentation network, where the two branches such as support branch and query branch are used to generate features for support image and query image respectively, and transformation module is added between the two branches to transfer the segmentation cues between the two branches. Based on such framework, the existing methods try to design new transformation module that is more general and efficient, and several types of transformation modules have been proposed [boots2017one, levine2018conditional]. It is proved that the segmentation results can be enhanced by improving the transformation module. However, learning general transformation module is also proved to be a challenging task [boots2017one].
Different from the existing strategy, we solve few-shot segmentation in a new perspective of “how to represent unseen classes by existing classes”. The idea is based on the assumption that every class can be formed by a basic attribute set . By learning the basic element set from known classes, the representation of each unseen classes can also be obtained. Therefore, the prior of unseen classes can be established, and is further used to achieve the segmentation of unseen classes. In other words, unseen classes is firstly represented by existing classes with a representation module. Then, the representation is used to segment regions of unseen classes more efficiently.
Motivated by this, this paper proposes a new few-shot segmentation network based on the strategy of representation. A new representation module in terms of class activation map is proposed. The idea is to represent the images of unseen classes based on its activation regions by the classification model of known classes. A two-branch based few-shot segmentation network is proposed. Different from the classical two branches such as support and query branch in the existing few-shot segmentation framework, the first branch is the prior generation branch that generates object prior of query image in terms of class activation map by the CAM generation module. The second branch is the segmentation branch that segments the foreground of query image based on the prior. A new CAM generation module based on the task of highlighting unseen classes rather than training classes is proposed, which firstly learns the CAM extraction module from support images of unseen classes, and then applies the CAM extraction module on query image to extract the prior map. We verify the proposed method on Pascal VOC 2012 dataset, the value FB-IoU of one-shot and five-shot arrives at 69.2% and 70.1% respectively, which outperforms several recent comparison methods.
Ii Proposed Method
Ii-a Problem Definition
Let be support images and manual annotations for unseen classes, where is the number of support images, is the binary annotation mask for image . The goal of few-shot segmentation is to build a model that outputs the binary mask for query images based on , aided by a set of existing training dataset .
The proposed network is shown in Fig. 1, which consists of two sub-networks such as the CAM generation sub-network and the segmentation sub-network. Given a support image and a query image, the first sub-network is used to generate class activation map of query image, based on the classification model of known classes. A CAM generation module is proposed. The second one outputs the segmentation mask of query image. We next detail the two sub-networks.
Ii-C CAM Generation Module for Unseen Class
Our goal is to represent unseen classes based on the existing classes in terms of class activation map, i.e., generating class activation map for unseen classes. Note that the classical CAM generation methods can not be used directly, as unseen classes is not considered in the classification model. Therefore, a new CAM generation module is proposed. Different from the classical CAM generation methods that use the gradient of back propagation to form the CAM, we form and learn the CAM extraction directly. We firstly learn the weight vectorfor unseen classes, where is the weight (similarity) of the th class to unseen classes. Then, the probability map of query image is obtained by averaging the CAMs of query image highlighted by different known classes, weighted by the vector . The detailed structure can be found in Fig. 1, where the proposed module consists of two steps such as learning the weight vector by support images, and the CAM extraction for query image based on .
Ii-C1 Learning Weights by Support Images
We intend to sufficiently use the manual annotations and support images to generate accurate CAM. A new CAM extraction module derived from the classical classification network for extracting CAM of unseen classes is proposed. The structure is shown in Fig. 1. The support image is used to obtain the weight . Specifically, we firstly set zero to the background pixels in order to consider the foregrounds of the manual mask only. Then, Res50 is used to extract the convolution feature of support image. Based on the last deep convolution feature of Res50, a convolution operation is applied to reduce the channel dimension of the convolution feature to the number of classes , where th channel means the class activation map of query image for th class. We set the obtained features as
. Afterwards, a multi-scale feature extraction block is implemented to obtain final class activation map denoted asthat has the same size to . The refined block consists of feature extraction step and feature combination step. One convolution operation and two convolution operation are used to extract multi-scale features, and the multi-scale features are combined to obtain the refined class activation map. Finally, a global average pooling is applied on the multiple-scale class activation map to get the weight vector .
Note that the proposed CAM extraction module is based on the classification network pre-trained on known classes. Here, we use the loss function Eq.1 to supervise the learning of the classification network, i.e.,
where is the class-level labels, and is the classification score.
It is seen that the weight is very important to the CAM generation for unseen class. Here, although the extraction of weight is learned automatically, it can be simply considered as the similarities between unseen class and training classes. Therefore, the CAM of unseen class can be obtained by the sum of the CAMs of training classes weighted by their similarities.
Ii-C2 Extracting CAM for Query Images
Given a query image of unseen class, we forward it to the classification network (Res50) to obtain the convolution feature . A convolution layer is then applied on to obtain with channel number . After that, we implement the multi-scale feature extraction block to obtain feature with the same size to . Then, each channel feature of is weighted with the weight vector , and the weighted channel features are summed to obtain the foreground prior of query image. Such process can be represented by
The foreground prior is then forwarded to the segmentation network for the foreground prediction. It is seen that the object regions of unseen class is obtained based on the relationships between unseen class and known classes.
Some class activation maps of unseen class can be found in Fig. 2. It is seen that the object regions of unseen class are highlighted successfully, which demonstrates the effectiveness of the proposed CAM based representation module.
Ii-D The Network of Few-shot Segmentation
After obtaining the class activation map , we next normalize it into [0, 1] by
Then, the normalized class activation map serves as an attention module to weight the feature of query image extracted from the backbone network. The filtered feature is then forwarded to a simple deconvolution block to obtain the final segmentation result . Such process can be represented as
where is the number of channels of . is the concatenation operation. The loss function is set to the cross-entropy loss.
|aeroplane, bicycle, bird, boat, bottle|
|bus, car, cat, chair, cow|
|diningtable, dog, horse, motorbike, person|
|potted plant, sheep, sofa, train, tv/monitor|
The details for classifying the 20 classes into four sub-datasets. There are 4 sub-datasets, andrepresents the th subset, where .
Ii-E Training and Inference
In training stage, because the proposed network is an end-to-end network, we train the network based on known classes directly. The details of the training setting can be found in Section III-A. It is worth noting that the class activation map is implicitly represented by feature and here. Therefore, the CAM extraction manner for unseen class is learned automatically and directly without back propagation.
In the reference stage, the segmentation result is obtained directly by the network without fine-tuning.
We implement the proposed network on Pytorch. Adam optimizer is used to update parameters. One Nvidia Titan XP GPU is used. We set learning rate to 1e-4 which decays 0.7 times per 10 epochs. Our backbone is set to Res50 pre-trained on ImageNet, and the top three layers is frozen during training. The size of input image is.
Iii-a Implementation Details
We implement experiment on Pascal VOC 2012 [Everingham2015The] dataset and its augmentation dataset SBD [hariharan2011semantic]. Similar to [boots2017one], we split the 20 classes into 4 sub-datasets, each of which contains 5 classes. Details can be found in Table I. For the four sub-datasets, one is selected as the unseen dataset for evaluation, the other three are used as known datasets for training. The image pairs for training are randomly selected from the training dataset. For fair comparison with the existing methods, we use the same seed for random sampling, and select the same 1000 image pairs in the testing stage. In the training stage, we use two of the three sub-datasets(ten classes) to train our classification network, and the rest one sub-dataset (five classes) as unseen classes to train the proposed network.
The FB-IoU [levine2018conditional] that calculates mean intersection over union of both foreground and background is used for objective evaluation.
Iii-B Subjective Results
The subjective results of the proposed method are shown in Fig. 3. The support images, the ground-truth of support images, the query images, the ground-truth of query images and the segmentation results are displayed from left column to right column, respectively. The first three rows show successful results. It is seen that the proposed method segments objects from these images successfully. Meanwhile, the last row displays some case of failures, where the region of “Dog” is wrongly segmented as “Cat”. This is caused by the fact that “Cat” and “Dog” are very similar so that it intends to segment both of the object regions as foreground.
Iii-C Objective Results and the Comparisons with Benchmarks
We next display the objective results in terms of FB-IoU value. In addition, we compare the proposed method with several recent few-shot segmentation methods. The results are displayed in Table II, where One-shot and Five-shot segmentation are considered. It is seen that the FB-IoU of One-shot segmentation on the four evaluation sub-dataset is 69.2%, which is better than the comparison methods. In addition, the value of the FB-IoU of Five-shot is 70.1%, which also outperforms the comparison methods. This demonstrates the effectiveness of the proposed method.
In this paper, a new few-shot segmentation strategy based on class representation is proposed. A novel few-shot segmentation network is established. The proposed segmentation network consists of two branches. One is CAM generation network that obtains the class activation map of query image based on the classification model pre-trained on known image and support image of unseen class. The other is segmentation network that segments foreground from query image based on the class activation map. A new CAM generation module for unseen class is proposed. The proposed method is verified on Pascal VOC dataset. Experimental results demonstrate the effectiveness of our proposed method with larger FB-IoU values.
This work was supported in part by the National Natural Science Foundation of China under Grant 61871087, Grant 61502084, Grant 61831005, and Grant 61601102, and supported in part by Sichuan Science and Technology Program under Grant 2018JY0141.