SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation
One-shot semantic segmentation poses a challenging task of recognizing the object regions from unseen categories with only one annotated example as supervision. In this paper, we propose a simple yet effective Similarity Guidance network to tackle the One-shot (SG-One) segmentation problem. We aim at predicting the segmentation mask of a query image with the reference to one densely labeled support image. To obtain the robust representative feature of the support image, we firstly propose a masked average pooling strategy for producing the guidance features using only the pixels belonging to the support image. We then leverage the cosine similarity to build the relationship between the guidance features and features of pixels from the query image. In this way, the possibilities embedded in the produced similarity maps can be adopted to guide the process of segmenting objects. Furthermore, our SG-One is a unified framework which can efficiently process both support and query images within one network and be learned in an end-to-end manner. We conduct extensive experiments on Pascal VOC 2012. In particular, our SG-One achieves the mIoU score of 46.3READ FULL TEXT VIEW PDF
SG-One: Similarity Guidance Network for One-Shot Semantic Segmentation
Object Semantic Segmentation (OSS) aims at predicting the class label of each pixel. Deep neural networks have achieved tremendous success on OSS tasks, such as U-net, FCN  and Mask R-CNN . However, these algorithms trained with full annotations require many investments to expensive labeling tasks. To reduce the budget, a promising alternative approach is to apply weak annotations for learning, e.g. image-level labels [25, 23, 24], scribbles [15, 21], bounding boxes [13, 6] and points . Whereas the main disadvantage of weakly supervised methods is the lack of the ability for generalization to unseen classes. For example, if a network is trained to segment dogs using thousands of images containing various breeds of dogs, it will not be able to segment bikes without finetuning the network using many images containing bikes.
In contrast, humans are very good at recognizing things with a few guidance. For instance, it is very easy for a child to recognize various breeds of dogs with the reference to only one picture of a dog. Inspired by this, one-shot learning dedicates to imitating this powerful ability of human beings. In other words, one-shot learning targets to recognize new objects according to only one annotated example. This is a great challenge for the standard learning methodology. Instead of using tremendous annotated instances to learn the characteristic patterns of a specific category, our target is to learn one-shot networks to generalize to unseen classes with only one densely annotated example.
Concretely, one-shot segmentation is to discover the object regions of a query image with the reference to only one support image. Both the query and support images are sampled from the same unseen category. Caelles et al.  propose a method for one-shot video object segmentation. However, this method needs to finetune and change the parameters during testing, which costs extra computational resources. Shaban et al.  is the first to study the one-shot image semantic segmentation problem. They propose a model named OSLSM by adopting the Siamese network 
to support semantic segmentation. Mainly, a pair of parallel networks is trained for extracting the features of labeled support images and query images, separately. These features are then fused to generate the probability maps of the target objects. Segmentation masks can finally be generated by thresholding the probability maps. Rakellyet al.  and Dong et al.  inherit the same framework of OSLSM and apply some slight changes to obtain a better segmentation performance. These methods provide an advantage that the trained parameters of observed classes can be directly utilized for testing unseen classes without finetuning. Nevertheless, there are some weaknesses with these methods: 1) The parameters of using the two parallel networks are redundant, which is prone to overfit and leads to the waste of computational resources; 2) combining the features of support and query images by mere multiplication is inadequate for guiding the query images to learn high-quality segmentation masks.
To overcome the above mentioned weaknesses, we propose a Similarity Guidance Network for One-Shot Semantic Segmentation (SG-One
) in this paper. The fundamental idea of SG-One is to guide the segmentation process by effectively incorporating the pixel-wise similarities between the features of support objects and query images. Particularly, we propose to apply the masked average pooling operation for extracting the representative vector of the support images. Then, the guidance maps are produced by calculating cosine similarities between the representative vector and the features of query images at each pixel. The generated guidance maps are applied to supply the information of desired regions to the segmentation process. In detail, the position-wise feature vectors of query images are multiplied by the corresponding similarity values. Such a strategy can effectively contribute to activating the target object regions of query images following the guidance of support images and their masks. Furthermore, we adopt a unified network for producing similarity guidance and predicting segmentation masks of query images. Such a network is more capable of generalizing to unseen classes than the previous methods[20, 17].
Our approach offers multiple appealing advantages over the previous state-of-the-arts, e.g. OSLSM  and co-FCN . First, OSLSM and co-FCN incorporate the segmentation masks of support images by changing the input structure of the network or the statistic distribution of input images. Differently, we extract the representative vector from the intermediate feature maps with the masked average pooling operation instead of changing the inputs. Our approach does not harm the input structure of the network, nor harm the statistics of input data. More importantly, our method is more capable of incorporating contextual information, because the entire image is fed into the network for learning features of desired regions so that contextual information is actually exploited thanks to the large receptive field. Averaging only the object regions can avoid the influences from the background. Otherwise, when the background pixels dominate, the learned features will bias towards the background contents. Second, OSLSM and co-FCN directly multiply the representative vector to feature maps of query images for predicting the segmentation masks. SG-One calculates the similarities between the representative vector and the features at each pixel of support images, and the similarity maps are employed to guide the segmentation branch for finding the target object regions. Our method is superior in the process of segmenting the query images. Third, both OSLSM and co-FCN adopt a pair of VGGnet-16 networks for processing support and query images, separately. We employ a unified network to process them simultaneously. The unified network utilizes much fewer parameters, so as to reduce the computational burden and increase the ability to generalize to new classes in testing.
The overview of SG-One is illustrated in Figure 1. We apply two branches, e.g. similarity guidance branch, and segmentation branch, to produce the guidance maps and segmentation masks. The similarity guidance branch firstly extracts the representative features of the target object in the support image, and then these features together with the features from the query image are employed to produce the similarity guidance maps. The segmentation process is guided by the similarity maps through multiplying with the segmentation features of the query images. The two branches are connected by concatenating the features maps, so the two branches can exchange information during the forward and backward stages. After the training phase, the SG-One network can predict the segmentation masks of a new class without changing the parameters. For example, a query image of an unseen class, e.g. cow, is processed to discover the pixels belonging to the cow with only one annotated support image provided. The similarity maps are generated by calculating cosine distance between the representative vector of the support object and features of the query image. We element-wisely multiply the similarity maps to the query feature maps from the segmentation branch as an attention guidance. Therefore, the segmentation branch can precisely predict the object-related pixels with the assistant of the guidance.
To sum up, our main contributions are three-fold:
We propose to produce robust object-related representative vectors using masked average pooling for incorporating contextual information without changing the input structure of networks.
We produce the pixel-wise guidance using cosine similarities between representative vectors and query features for predicting the segmentation masks.
We propose a unified network for processing support and query images. Our network achieves the cross-validate mIoU of 46.3% on the PASCAL-5i dataset in one-shot segmentation setting, which is a new state-of-the-art.
Object semantic segmentation
(OSS) aims at classifying every pixel in a given image. OSS with dense annotations has achieved great success in precisely identifying various kinds of objects. FCN and U-Net  abandon fully connected layers and propose to only use convolutional layers for preserving relative positions of pixels. Based on the advantages of FCN, DeepLab proposed by Chen et al. [4, 5], is one of the best algorithms for segmentation. It employs dilated convolution operations to increase the receptive field, and meanwhile to save parameters in comparison with the large convolutional kernel methods. He et al.  proposes segmentation masks and detection bounding boxes can be predicted simultaneously using a unified network.
Weakly object segmentation seeks an alternative approach for segmentation to reduce the expenses in labeling segmentation masks. Zhou  and Zhang [26, 27] propose to discover precise object regions using a classification network with only image-level labels. Wei [25, 23] apply a two-stage mechanism to predict segmentation masks. Concretely, confident regions of objects and background are firstly extracted via the methods of Zhou et al. or Zhang et al. Then, a segmentation network, such as DeepLab, can be trained for segmenting the target regions. An alternative weakly segmentation approach is to use scribble lines to indicate the rough positions of objects and background. Lin et al.  and Tang et al. 
adopted spectral clustering methods to distinguish the object pixels according to the similarity of adjacent pixels and ground-truth scribble lines.
Few-shot learning algorithms dedicates to distinguishing the patterns of classes or objects with only a few labeled samples. Networks should generalize to recognize new objects with few images based on the parameters of base models. The base models are trained using entirely different classes without any overlaps with the testing classes. Finn et al.  tries to learn some internal transferable representations, and these representations are broadly applicable to various tasks. Vinyals et al.  and Annadani et al.  propose to learn embedding vectors. The vectors of the same categories are close, while the vectors of the different categories are apart.
Suppose we have three datasets: a training set , a support set and a testing set , where is an image, is the corresponding segmentation mask and is the number of images in each set. Both the support set and training set have annotated segmentation masks. The support set and testing set share the same types of objects which are disjoint with the training set. We denote as the semantic class of the mask . Therefore, we have . If there are annotated images for each of new classes, the target few-shot problem is named -way -shot. Our purpose is to train a network on the training set , which can precisely predict segmentation masks on testing set according to the reference of the support set .
In order to better learn the connection between the support and testing set, we mimic this mechanism in the training process. For a query image , we construct a pair by randomly selecting a support image whose mask has the same semantic class as
. We are supposed to estimate the segmentation maskwith a function , where is the parameters of the function. In testing, is picked from the support set and is an image from testing set .
In this section, we firstly present the masked average pooling operation for extracting the object-related representative vector of annotated support images. Then, the similarity guidance method is introduced for combining the representative vectors and features of query images. The generated similarity guidance maps supply the information for precisely predicting the segmentation masks.
Masked Average Pooling The pairs of support images and their masks are usually encoded into representative vectors. OSLSM  proposes to erase the background pixels from the support images by multiplying the binary masks to support images. co-FCN 
proposes to construct the input block of five channels by concatenating the support images with their positive and negative masks. However, there are two disadvantages of the two methods. First, erasing the background pixels to zeros will change the statistic distribution of the support image set. If we apply a unified network to process both the query images and the erased support images, the variance of the input data will greatly increase. Second, concatenating the support images with their masks breaks the input structure of the network, which will also prevent the implementation of a unified network.
We propose to employ masked average pooling for extracting the representative vectors of support objects. Suppose we have a support RGB image and its binary segmentation mask , where and are the width and height of the image. If the output feature maps of is , where is the number of channels, and are width and height of the feature maps. We firstly resize the feature maps to the same size as the mask
via bilinear interpolation. We denote the resized feature maps as. Then, the element of the representative vector is computed by averaging the pixels within the object regions on the feature map.
As the discussion in FCN , fully convolutional networks are able to preserve the relative positions of input pixels. Therefore, through masked average pooling, we expect to extract the features of object regions while disregarding the background contents. Also, we argue that the input of contextual regions in our method is helpful to learn better object features. This statement has been discussed in DeepLab  which incorporates contextual information using dilated convolutions. Masked average pooling keeps the input structure of the network unchanged, which enables us to process both the support and query images within a unified network.
Similarity Guidance One-shot semantic segmentation aims to segment the target object within query images given a support image of the reference object. As we have discussed, the masked average pooling method is employed to extract the representative vector of the reference object, where is the number of channels. Suppose the feature maps of a query image is . We employ the cosine distance to measure the similarity between the representative vector and each pixel within following Eq. (2)
where is the similarity value at the pixel . is the feature vector of query image at the pixel . As a result, the similarity map integrates the features of the support object and the query image. We use the map as guidance to teach the segmentation branch to discover the desired object regions. We do not explicitly optimize the cosine similarity. In particular, we element-wisely multiply the similarity guidance map to the feature maps of query images from segmentation branch. Then, we optimize the guided feature maps to fit the corresponding ground truth masks.
Figure 2 depicts the structure of our proposed model. SG-One includes three components, Stem, Similarity Guidance Branch, and Segmentation Branch. Different components have different structures and functionalities. Stem is a fully convolutional network for extracting intermediate features of both support and query images.
Similarity Guidance Branch is fed the extracted features of both query and support images. We apply this branch to produce the similarity guidance maps by combining the features of reference objects with the features of query images. For the features of support images, we implement three convolutional blocks to extract the highly abstract and semantic features, followed by a masked averaged pooling layer to obtain representative vectors. The extracted representative vectors of support images are expected to contain the high-level semantic features of a specific object. For the features of query images, we reuse the three blocks and employ the cosine similarity layer to calculate the closeness between the representative vector and the features at each pixel of the query images.
Segmentation branch is for discovering the target object regions of query images with the guidance of the generated similarity maps. We employ three convolutional layers with the kernel size of to obtain the features for segmentation. The inputs of the last two convolutional layers are concatenated with the paralleling feature maps from Similarity Guidance Branch. Through the concatenation, Segmentation Branch can borrow features from the paralleling branch, and these two branches can communicate information during the forward and backward stages. We fuse the generated features with the similarity guidance maps by multiplication at each pixel. Finally, the fused features are processed by two convolutional layers with the kernel size of and
, followed by a bilinear interpolation layer. The network finally classifies each pixel to be the same class with support images or to be background. We employ the cross-entropy loss function to optimize the network in an end-to-end manner.
One-Shot Testing One annotated support image for each unseen category is provided as guidance to segment the target semantic objects of query images. We do not need to finetune or change any parameters of the entire network. We only need to forward the query and support images through the network for generating the expected segmentation masks.
K-Shot Testing Suppose there are support images for each new category, we propose to segment the query image using two approaches. The first one is to ensemble the segmentation masks corresponding to the support images following OSLSM  and co-FCN  based on Eq. (3)
where is the predicted semantic label of the pixel at corresponding to the support image . Another approach is to average the representative vectors, and then use the averaged vector to guide the segmentation process. It is notable that we do not need to retrain the network using -shot support images. We use trained network in a one-shot manner to test the segmentation performance using -shot support images.
Following the evaluation method of the previous methods OSLSM  and co-FCN , we create the PASCAL-5i using the PASCAL VOC 2012 dataset  and the extended SDS dataset . For the 20 object categories in PASCAL VOC, we use cross-validation method to evaluate the proposed model by sampling five classes as test categories in Table 1, where is the fold number, while the left 15 classes are the training label-set . We follow the same procedure of the baseline methods e.g. OSLSM  to build the training and testing set. For a fair comparison, we use the same test set as OSLSM , which has 1,000 support-query tuples for each fold.
Suppose the predicted segmentation mask is and the corresponding ground-truth annotation is , given a specific class . We define the Intersection over Union () of class as , where the and are the number of true positives, false positives and false negatives of the predicted masks. The mIoU is the average of IoUs over different classes, i.e. , where is the number of testing classes. We report the averaged mIoU on the four cross-validation datasets.
We implement the proposed approach based on the VGG-16 network following the previous works [20, 17]. Stem takes the input of RGB images to extract middle-level features, and downsamples the images by the scale of 8. We use the first three blocks of the VGG-16 network as Stem. For the first two convolutional blocks of Similarity Guidance Branch, we adopt the structure of conv4 and conv5 of VGGnet-16 and remove the maxpooling layers to maintain the resolution of feature maps. One conv33 layer of 512 channels are added on the top without using after this layer. The following module is masked average pooling to extract the representative vector of support images. In Segmentation Branch, all of the convolutional layers with kernel size have 128 channels. The last layer of conv11 kernel has two channels corresponding to categories of either object or background. All of the convolutional layers except for the third and the last one are followed by a layer. We will justify this choice in section 4.5.
. All input images remain the original size without any data augmentations. We implement the network using PyTorch. We train the network with the learning rate of. The batch size is , and the weight decay is 0.0005. We adopt the SGD optimizer with the momentum of 0.9. All networks are trained and tested on NVIDIA TITAN X GPUs with 12 GB memory. Our source code is available at https://github.com/xiaomengyc/SG-One.
|Methods111The details of baseline methods e.g. 1-NN, LogReg and Siamese, refer to OSLSM . The results for co-FCN  are for our re-implemented method. Table 4 reports the evaluation results regarding the same metric adopted in  for a fairer comparison.||PASCAL-50||PASCAL-51||PASCAL-52||PASCAL-53||Mean|
One-shot Table 2 compares the proposed SG-One approach with baseline methods in one-shot semantic segmentation. We observe that our method outperforms all baseline models. The mIoU of our approach on the four divisions achieves 46.3%, which is significantly better than co-FCN by 5.2% and OSLSM by 5.5%. Compared to the baselines, SG-One earns the largest gain of 7.8% on PASCAL-51, where the testing classes are bus, car, cat, chair and cow. co-FCN  constructs the input of the support network by concatenating the support images, positive and negative masks, and it obtains 41.1%. OSLSM  proposed to feed only the object pixels as input by masking out the background regions, and this method obtains 40.8%.
OSVOS  adopts a strategy of finetuning the network using the support samples in testing, and it achieves only 32.6%. To summarize, SG-One can effectively predict segmentation masks on new classes without changing the parameters. Our similarity guidance method is better than the baseline methods in incorporating the support objects for segmenting unseen objects.
Figure 3 shows the one-shot segmentation results using SG-One on unseen classes. We observe that SG-One can precisely distinguish the object regions from the background with the guidance of the support images, even if some support images and query images do not share much appearance similarities. We also show some failure cases to benefit the future researches. We ascribe the failure to 1) the target object regions are too similar to background pixels, e.g. the side of the bus and the car; 2) the target region have very uncommon features with the discovered discriminative regions, e.g. the vest of the dog, which may far distant with the representative feature of support objects.
Five-shot Table 3 illustrates the five-shot segmentation results on the four divisions. As we have discussed, we apply two approaches to five-shot semantic segmentation. The approach of averaging the representative vectors from the five support images achieves 47.1% which significantly outperforms the current state-of-the-art, co-FCN, by 5.7%. This result is also better than the corresponding one-shot mIoU of 46.3%. Therefore, the averaged support vector has a better representativeness of the features in guiding the segmentation process. Another approach is to solely fuse the final segmentation results by combining all of the detected object pixels. We do not observe any improvement of this approach, comparing to the one-shot result. It is notable that we do not specifically train a new network for five-shot segmentation. The trained network in a one-shot way is directly applied to predict the five-shot segmentation results.
For a fairer comparison, we also evaluate the proposed model regarding the same metric in co-FCN  and PL+SEG . This metric firstly calculates the IoU of foreground and background, and then obtains the mean IoU of the foreground and background pixels. We still report the averaged mIoU on the four cross-validation datasets. Table 4 compares SG-One with the baseline methods regarding this metric in terms of one-shot and five-shot semantic segmentation. It is obvious that the proposed approach outperforms all previous baselines. SG-One achieves 63.1% of one-shot segmentation and 65.9% of five-shot segmentation, while the most competitive baseline method PL+SEG only obtains 61.2% and 62.3%. The proposed network is trained end-to-end, and our results do not require any pre-processing or post-processing steps.
We conduct experiments to verify the ability of SG-One in segmenting images with multiple classes. We randomly select 1000 entries of the query and support images. Query images may contain objects of multiple classes. For each entry, we sample five annotated images from the five testing classes as support images. For every query image, we predict its segmentation masks with the images of different support classes. We fuse the segmentation masks of the five classes by comparing the classification scores. The mIoU on the four datasets is 29.4%. The proposed SG-One approach can segment the objects of different unseen classes with the guidance of support images.
Masked Average Pooling The masked average pooling method employed in the proposed SG-One network is superior in incorporating the guidance masks of support images. Shaban et al.  proposed to multiply the binary masks to the input support RGB images, so that the network can only extract features of target objects. co-FCN  proposed by Rakelly et al. concatenates the support RGB images with the corresponding positive masks, i.e. object pixels are 1 while background pixels are 0, and negative binary masks i.e. object pixels are 1 and background pixels are 0, constituting the inputs of 5 channels. We follow the instructions of these two methods and compare with our masked average pooling approach. Concretely, we firstly replace the masked average pooling layer by a global average pooling layer. Then, we implement two networks. 1) SG-One-masking adopts the methods in OSLSM , in which support images are multiplied by the binary masks to solely keep the object regions. 2) SG-One-concatenate adopts the methods in co-FCN , in which we concatenate the positive and negative masks to the support images forming an input with 5 channels. We add an extra input block (VGGnet-16) with 5 convolutional channels for adapting concatenated inputs, while the rest networks are exactly the same as the compared networks.
Table 5 compares the performance of different methods in processing support images and masks. Our masked average pooling approach achieves the best results on every dataset. The mIoU of the four datasets is 46.3% using our method. The masking method (SG-One-masking) proposed in OSLSM  obtains 45.0% of the mIoU. The approach of co-FCN (SG-One-concat) only obtains 41.75%, which ascribes the modification of the input structure of the network. The modified input block cannot benefit from the pre-trained weights of processing low-level information. We also implement a network without using the binary masks of the support images, and this network achieves mIoU of 42.2%. In total, we can conclude that 1) a qualified method of using support masks is crucial for extracting high-quality object features; 2) the proposed masked average pooling method provides a superior way to reuse the structure of well-designed classification network for extracting object features of support pairs; 3) networks with 5-channel input cannot benefit from the pre-trained weights and the extra input block cannot be jointly trained with the query images. 4) the masked average pooling layer has superior generalization ability in segmenting unseen classes.
Guidance Similarity Generating Methods We adopt the cosine similarity to calculate the distance between the object feature vector and the feature maps of query images. The definition of the cosine distance is to measure the angle between two vectors, and its range is in . Correspondingly, we abandon the ReLU layers after the third convolutional layers of both guidance and segmentation branches. By doing so, we increase the variance of the cosine measurement, and the cosine similarity is not partly bounded in , but in . For comparison, we add the ReLU layers after the third convolutional layers. The mIoU on the four datasets drops to 45.5% comparing to the non-ReLU approach of 46.3%.
We also train a network using the 2-norm distance as the guidance, and obtain 30.7% on the four datasets. This result is far poor than the proposed cosine similarity method. Hence, the 2-norm distance is not a good choice for guiding the query images to discover target object regions.
The Unified Structure We adopt the proposed unified structure between the guidance and segmentation branches. This structure can benefit each other during the forward and backward stages. We implement two networks for illustrating the effectiveness of this structure. First, we remove the first three convolutional layers of Segmentation Branch, and then multiply the guidance similarity maps directly to the feature maps from Similarity Guidance Branch. The final mIoU of the four datasets decreases to 43.1%. Second, we cut off the connections between the two branches by removing the first and second concatenation operations between the two branches. The final mIoU obtains 45.7%. Therefore, Segmentation Branch in our unified network is very necessary for getting high-quality segmentation masks. Also, Segmentation Branch can borrow some information via the concatenation operation between the two branches.
We also verify the functionality of the proposed unified network in the demand of computational resources and generalization ability. In Table 6, we observe that our SG-One model has only 19.0M parameters, while it achieves the best segmentation results. Following the methods in OSLSM  and co-FCN , we use a separate network (SG-One-separate) to process support images. This network has slightly more parameters (36.1M) than co-FCN(34.2M). The mIoU of SG-One-separate obtains 44.8%, and this result is far better than the 41.1% of co-FCN. This comparison shows that the approach we applied for incorporating the guidance information from support image pairs is superior to OSLSM and co-FCN in segmenting unseen classes. Surprisingly, the proposed unified network can even achieve higher performance of 46.3%. We attribute the gain of 1.5% to the reuse of the network in extracting support and query features. The reuse strategy not only reduces the demand of computational resources and decreases the risk of over-fitting, but offers the network more opportunities to see more training samples. OSLSM requires the most parameters (272.6M), whereas it has the lowest score.
We present that SG-One can effectively segment semantic pixels of unseen categories using only one annotated example. We abandon the previous strategy [20, 17] and propose the masked average pooling approach to produce more robust object-related representative features. Extensive experiments show the masked average pooling approach is more convenient and capable of incorporating contextual information to learn better representative vectors. We also reduce the risks of the overfitting problem by avoiding the utilization of extra parameters through a unified network. We propose that a well-trained network on images of a single class can be directly applied to segment multi-class images. We present a pure end-to-end network, which does not require any pre-processing or post-processing steps. More importantly, SG-One boosts the performance of one-shot semantic segmentation to a new state-of-the-art.
Proceedings of the IEEE International Conference on Computer Vision, pages 1635–1643, 2015.
ICML Deep Learning Workshop, volume 2, 2015.
Learning Deep Features for Discriminative Localization.IEEE CVPR, 2016.