This paper is about few-shot segmentation of foreground objects in images. As Fig. 1 shows, given only a few training examples – called support images – and their ground-truth segmentation of the target object class, our goal is to segment the target class in the query image. This problem is challenging, because the support and query images may significantly differ in the number of instances and 3D poses of the target class, as illustrated in Fig. 1. This important problem arises in many applications dealing with scarce training examples of target classes.
Recently, prior work has addressed this problem by training an object segmenter on a large training set, under the few-shot constraint [26, 6, 21]. The training set is split into many small subsets. In every subset, one image serves as the query and the other(s) as the support image(s) with known ground truth(s). As shown in Fig. 1, their framework uses a CNN – e.g., VGG  or ResNet  – for extracting feature maps from the support and query images. The support’s feature maps are first pooled over the known ground-truth foreground. Then, the support’s masked-pooled features are used to estimate a cosine similarity map with the query’s features. The resulting similarity map and the query’s features are finally passed to a few convolutional layers in order to segment the target object class in the query. The incurred loss between the prediction and the query’s ground-truth is used for the CNN’s training.
The above framework has two critical limitations which we address in this paper. First, we experimentally found that the CNN has a tendency to learn non-discriminative features with high activations for different classes. To address this issue, as Fig. 2 shows, our first contribution extends prior work by efficiently estimating feature relevance so as to encourage that their activations are high inside the ground-truth locations of the target class, and low elsewhere in the image. This is formulated as an optimization problem, for which we derive a closed-form solution.
Second, learning from few support images is prone to overfitting and poor generalization to the query in the face of the aforementioned large variations of the target class. To address this issue, as Fig. 3 shows, our second contribution is a new boosted inference, motivated by the traditional ensemble learning methods which are robust to overfitting [9, 10]. We specify an ensemble of experts, where each expert adapts the features initially extracted from the support image. This feature adaptation is guided by the gradient of loss incurred when segmenting the support image relative to its provided ground truth. The ensemble of experts produce the corresponding ensemble of object segmentations of the query image, whose weighted average is taken as our final prediction. Importantly, while we use the first contribution in both training and testing, similar to the traditional ensemble learning methods, our second contribution is applied only in testing for boosting the performance of our CNN-based segmenter.
For -shot setting, both contributions are naturally extended for segmenting the query image by jointly analyzing the provided support images and their ground-truths rather than treating support images independently as in prior work.
For evaluation, we compare with prior work on the benchmark PASCAL- dataset . Our results demonstrate that we significantly outperform the state of the art. In addition, we perform evaluation on the larger and more challenging COCO- dataset . To the best of our knowledge, we are the first to report results of few-shot object segmentation on COCO-.
2 Related Work
This section reviews related work on few-shot image classification and semantic segmentation.
predicts image class labels with access to few training examples. Prior work can be broadly divided into three groups: transfer learning of models trained on classes similar to the target classes[28, 22, 29, 30], meta-learning approaches that learn how to effectively learn new classes from small datasets [8, 18], and generative approaches aimed at data-augmentation [31, 25].
Semantic segmentation labels all pixels in the image. Recently, significant advances have been made by using fully convolutional network (FCN)  and its variants — including SegNet , UNet , RefineNet , PSPNet , DeepLab v1, v2 , v3 , v3+  — all of which are usually evaluated on the PASCAL VOC 2012  and MSCOCO  datasets. However, these approaches typically require very large training sets, which limits their application to a wide range of domains.
Few-shot semantic segmentation labels pixels of the query image that belong to a target object class, conditioned by the ground-truth segmentation masks of a few support images. Prior work typically draws from the above mentioned approaches to few-shot image classification and semantic segmentation. For example, the one-shot learning method OSLSM  and its extensions — namely, Co-FCN [21, 20], PL+SEG , and SG-One  — consist of the conditioning and segmentation branches implemented as VGG  and FCN-32s , respectively. The conditioning branch analyzes the target class in the support image, and conditions the segmentation branch for object segmentation in the query image. Co-FCN improves OSLSM by segmenting the query image based on a concatenation of pooled features from the support image and feature maps from the query image. PL+SEG first estimates a distance between the query’s feature maps and prototypes predicted from the support image, and then labels pixels in the query image with the same class as their nearest neighbor prototypes. SG-One also estimates similarity between a pooled feature from the support image and feature maps of the query for predicting the query’s segmentation. Our approach extends SG-One with the two contributions specified in the next section.
3 Our Approach
An object class is represented by support images with ground-truth segmentation masks. Given a query image showing the same object class in the foreground, our goal is predict the foreground segmentation mask. For , this problem is called one-shot semantic segmentation. Below, and in the following Sec. 3.1 and Sec. 3.2, we consider the one-shot setting, for simplicity. Then, we discuss the K-shot setting, , in Sec. 3.3
Given a large set of training images showing various object classes and the associated ground-truth segmentation masks, our approach follows the common episodic training strategy. In each training episode, we randomly sample a pair of support and query images, and , with binary segmentation masks, and , of the target object class in the foreground. Elements of the mask are set to 1, , for pixels occupied by the target class; otherwise, . The same holds for . We use to condition the target class in the query image.
The standard cross-entropy loss, , between the binary ground-truth and predicted query mask , is used for the end-to-end training of parameters of our deep architecture, .
3.1 Training of Our Deep Architecture
Fig. 2 shows the episodic training of a part of our deep architecture (without our second contribution which is not trained) on a pair of support and query images. We first use a CNN to extract feature maps from , and feature maps from , where is the feature dimensionality, and and denote the width and height of the feature map. Then, we average over the known foreground locations in , resulting in the average class feature vector of the support image. For this masked feature averaging, is down-sampled to with the size of the feature maps, and is estimated as
where is the number of foreground locations in . Next, we compute the cosine similarity between and every feature vector from the query feature maps . This gives a similarity map, , between the support and query image:
We expect that provides informative cues for object segmentation, as high values of indicate likely locations of the target class in the query image.
Before we explain how to finally predict from and , in the following, we specify our first technical contribution aimed at extending the above framework.
Contribution 1: Feature Weighting. For learning more discriminative features of the target class from a single (or very few) support image(s), we introduce a regularization that encourages high feature activations on the foreground and simultaneoulsy low feature activations on the background of the support image. This is formalized as an optimization problem for maximizing a sum of relevant differences between feature activations. Let denote a vector of feature differences normalized over the foreground and background areas in the segmentation map of the support image as
The relevance of features in is estimated by maximizing a sum of the feature differences:
The problem in (4) has a closed-form solution:
We use the estimated feature relevance when computing the similarity map between the support and query images. Specifically, we modify the cosine similarity between and , given by (2), as
where is the element-wise product between two vectors.
Note that we account for feature relevance in both training and testing. As has a closed-form solution, it can be computed very efficiently. Also, the modification of the similarity map in (6
) is quite simple and cheap to implement in modern deep learning frameworks.
As shown in Fig. 2, in the final step of our processing, we concatenate and together, and pass them to a network with only two convolutional layers for predicting .
3.2 Contribution 2: Feature Boosting
In testing, the CNN is supposed to address a new object class which has not been seen in training. To improve generalization to the new class, in testing, we use a boosted inference – our second contribution – inspired by the gradient boosting. Alg.1 summarizes our boosted inference in testing. Note that in testing parameters of the CNN and convolutional layers remain fixed to the trained values.
As shown in Fig. 3, given a support image with ground-truth and a query image, in testing, we predict not only the query mask , but also the support mask using the same deep architecture as specified in Sec. 3.1. is estimated in two steps. First, we compute a similarity map, , as a dot product between and , as in (6). Second, we pass and to the two-layer convolutional network for predicting . Third, we estimate the standard cross-entropy loss , and iteratively update the average class features as
where is the learning rate. , , are experts that we use for predicting the corresponding query masks , by first estimating the similarity map , , as in (6), and then passing and to the two-layer network for computing .
Finally, we fuse the ensemble into the final segmentation, , as
where denotes our estimate of the expert’s confidence in correctly segmenting the target class, computed as the intersection-over-union score between and :
3.3 -shot Setting
When the number of support images , prior work [21, 20, 6, 32] predicts for each support image independently, and then estimates as an average over these predictions, . In contrast, our two contributions can be conveniently extended to the K-shot setting so as to further improve our robustness, beyond the standard averaging over independent segmentations of the query.
Our contribution 1 is extended by estimating relevance of a more general difference vector of feature activations defined as
Similar to (4) and (5), the optimal feature relevance has a closed-form sulution . Note that we estimate jointly over all support images, rather than as an average of independently estimated feature relevances for each support image. We expect the former (i.e., our approach) to be more robust than the latter.
Our contribution 2 is extended by having a more robust update of than in (7):
where is the cross entropy loss incurred for predicting the segmentation mask using the unique vector given by (11) for every support image , as explained in Sec. 3.2. Importantly, we do not generate independent ensembles of experts for each of the support images. Rather, we estimate a single ensemble of experts more robustly over all support images, starting with the initial expert .
4 Implementation Details and Complexity
. Our CNN has the last two convolutional layers modified so as to have the stride equal to 1 instead of 2 in the original networks. This is combined with a dilated convolution to enlarge the receptive field with rates 2 and 4, respectively. So the final feature maps of our network have, which is 1/8 of the input image size. For the two-layer convolutional network (Conv) aimed at producing the final segmentation, we use a
convolution with ReLU and 128 channels, and aconvolution with 2 output channels – background and foreground. It is worth noting that we do not use a CRF as a common post-processing step .
For implementation, we use Pytorch. Following the baselines [32, 21, 20]
, we pretrain the CNN on ImageNet. Training images are resized to , while keeping the original aspect ratio. All test images keep their original size. Training is done with the SGD, learning rate , batch size 8, and in 10,000 iterations. For the contribution 2, the number of experts is analyzed in Sec. 5. For updating in (7), we use Adam optimizer  with .
Complexity. In training, prior work [32, 21, 20]: (1) Uses a CNN with complexity for extracting features from the support and query images, (2) Computes the similarity map with complexity , and (3)  additionally uses a convolutional network for segmenting the query with complexity . Note that as both the similarity map and convolutions in the Conv network are computed over feature maps with the size . Also, note that is significantly smaller than . Our contribution 1 additionally computes the feature relevance using the closed-form solution with a linear complexity in the size of the feature maps . Therefore, our total training complexity is equal to that of prior work [32, 21, 20]: .
In testing, complexity of prior work [32, 21, 20] is the same as . Our contribution 2 increases complexity in testing by additionally estimating the ensemble of segmentations of the query image. Therefore, in testing, our complexity is . Thus, in testing, we increase only the smaller term of the total complexity. For small , we have that the first term still dominates the total complexity. As we show in Sec. 5, for , we significantly outperform the state of the art, which justifies our slight increase in testing complexity.
Datasets. For evaluation, we use two datasets: (a) PASCAL- which combines images from the PASCAL VOC 2012  and Extended SDS  datasets; and (b) COCO- which is based on the MSCOCO dataset . For PASCAL-, we use the same 4-fold cross-validation setup as prior work [26, 21, 6]. Specifically, from the 20 object classes in PASCAL VOC 2012, for each fold , we sample five as test classes, and use the remaining 15 classes for training. Tab. 1 specifies our test classes for each fold of PASCAL-. As in , in each fold , we use support-query pairs of test images sampled from the selected five test classes.
|PASCAL-50||aeroplane, bicycle, bird, boat, bottle|
|PASCAL-51||bus, car, cat, chair, cow|
|PASCAL-52||diningtable, dog, horse, motorbike, person|
|PASCAL-53||potted plant, sheep, sofa, train, tv/monitor|
|33||Sports ball||34||Kite||35||B. bat||36||B. glove|
We create COCO- for evaluation on a more challenging dataset than PASCAL-, since MSCOCO has 80 object classes and its ground-truth segmentation masks have lower quality than those in PASCAL VOC 2012. To the best of our knowledge, no related work has reported one-shot object segmentation on MSCOCO. For evaluation on COCO-, we use 4-fold cross-validation. From the 80 object classes in MSCOCO, for each fold , we sample 20 as test classes, and use the remaining 60 classes for training. Tab. 2 specifies our test classes for each fold of COCO-. In each fold, we sample 1000 support-query pairs of test images from the selected 20 test classes.
|VGG 16||OSLSM ||33.60||55.30||40.90||33.50||40.80|
|B + C1||43.34||56.72||50.64||44.01||48.68|
|B + C2||46.49||60.27||51.45||46.67||51.22|
|B + C1 + C2||47.04||59.64||52.61||48.27||51.90|
|B + C1||46.06||61.22||54.90||48.65||52.71|
|B + C2||47.46||63.76||54.11||51.50||54.21|
|B + C1 + C2||51.30||64.49||56.71||52.24||56.19|
|VGG 16||OSLSM ||35.90||58.10||42.70||39.10||43.95|
Metrics. As in [26, 21, 6], we use the mean intersection-over-union (mIoU) for quantitative evaluation. IoU of class is defined as , where and are the number of pixels that are true positives, false positives and false negatives of the predicted segmentation masks, respectively. The mIoU is an average of the IoUs of different classes, , where is the number of test classes. We report the mIoU averaged over the four folds of cross-validation.
before our contribution 1. We also consider several ablations of our approach: B+C1 – extends the baseline B with contribution 1 only; B+C2 – extends the baseline B with contribution 2 only; and B+C1+C2 – represents our full approach. These ablations are aimed at testing the effect of each of our contributions on performance. In addition, we consider two alternative neural networks – VGG 16 and ResNet 101 – as the CNN for extracting image features. VGG 16 has also been used in prior work[26, 21, 6]. We also compare with an approach called Upper-bound that represents a variant of our full approach B+C1+C2 trained such that both training and testing datasets consist of the same classes. As Upper-bound does not encounter new classes in testing, it represents an upper bound of our full approach. Finally, in the K-shot setting, we consider another baseline called Average. It represents our full approach B+C1+C2 that first independently predicts segmentations of the query image for each of the support images, and then averages all of the predictions. Our approach for the K-shot setting is called Our-K-shot, and differs from Average in that we rather jointly analyze all of the support images than treat them independently, as explained in Sec. 3.3.
Training/testing time. The training/testing time is reported in Tab. 3. We can see that the contribution 1 just adds very small computational overhead over the baseline but significantly outperforms the baseline. Additionally, although contribution 2 has substantially larger testing time (about 40% with VGG backbone and 35% with ResNet backbone compare to the baseline), but it yields more significant performance gain than contribution 1 does.
One-shot Segmentation. Tab. 4 compares our B+C1+C2 with the state of the art, ablations, and aforementioned variants in the one-shot setting on PASCAL-. B+C1+C2 gives the best performance for both VGG 16 and ResNet 101, where the latter configuration significantly outperforms the state of the art with the increase in the mIoU averaged over the four folds of cross-validation by 13.49%. Relative to B, our first contribution evaluated with B+C1 gives relatively modest performance improvements. From the results for B+C2, our second contribution produces larger gains in performance relative to B and B+C1, suggesting that contribution 2 in and of itself is more critical than contribution 1. Interestingly, combining both contribution 1 and contribution 2 significantly improves the results relative to using either contribution only. We also observe that performance of our B+C1+C2 for some folds of cross-validation (e.g., PASCAL- and PASCAL-) comes very close to that of Upper-bound, suggesting that our approach is very effective in generalizing to new classes in testing. Fig. 4 shows the mIoU of B+C1+C2 as a function of the number of experts in the one-shot setting on PASCAL-. As can be seen, for our approach is not sensitive to a particular choice of . We use as a good trade-off between complexity and accuracy.
Five-shot Segmentation. Tab. 5 compares Our-K-shot with the state of the art and Average in the five-shot setting on PASCAL-. Our-K-shot gives the best performance for both VGG 16 and ResNet 101, where the latter configuration significantly outperforms the state of the art with the increase in the mIoU averaged over the four folds of cross-validation by 15.97%. In comparison with Average, the joint analysis of support images by Our-K-shot appears to be more effective, as Our-K-shot gives superior performance in every fold of cross-validation.
Results on COCO-. Tab. 6 and Tab. 7 shows our ablations’ results in the one-shot and five-shot settings on COCO-. The former results are obtained with B+C1+C2 and the latter, with Our-K-shot. The lower values of mIoU relative to those in Tab. 4 and Tab. 5 indicate that COCO- is more challenging than PASCAL-. Surprisingly, in fold COCO-, B+C1+C2 with VGG 16 outperforms its counterpart with ResNet 101 in the one-shot setting. The same holds for Our-K-shot in the five-shot setting. On average, using ResNet 101 gives higher results. As expected, the increased supervision in the five-shot setting in general gives higher accuracy than the one-shot setting.
Qualitative Results. Fig. 5 shows challenging examples from PASCAL-, and our segmentation results obtained with B+C1+C2 with ResNet 101 for the one-shot setting, and Our-K-shot with ResNet 101 for the five-shot setting. In the leftmost column, the bike in the support image has different pose from the bike in the query image. While this example is challenging for B+C1+C2, our performance improves when using Our-K-shot. In the second column from left, the query image shows a partially occluded target – a part of the bottle. With five support images, Our-K-shot improves performance by capturing the bottle’s shadow. The third column from left shows that the bike’s features in the support image are insufficiently discriminative as the person also gets segmented along with the bike. With more examples, the bike is successfully segmented by Our-K-shot. In the rightmost column, the plane in the support image is partially occluded, and thus in the query image B+C1+C2 can only predict the head of the airplane while Our-K-shot’s predicted segment covers most of the airplane.
We have addressed one-shot and few-shot object segmentation, where the goal is to segment a query image, given a support image and the support’s ground-truth segmentation. We have made two contributions. First, we have formulated an optimization problem that encourages high feature responses on the foreground and low feature activations on the background for more accurate object segmentation. Second, we have specified the gradient boosting of our model for fine-tuning to new classes in testing. Both contributions have been extended to the few-shot setting for segmenting the query by jointly analyzing the provided support images and their ground truths, rather than treating the support images independently. For evaluation, we have compared with prior work, strong baselines, ablations and variants of our approach on the PASCAL- and COCO- datasets. We significantly outperform the state of the art on both datasets and in both one-shot and five-shot settings. Using only the second contribution gives better results than using only the first contribution. Our integration of both contributions gives a significant gain in performance over each.
Acknowledgement. This work was supported in part by DARPA XAI Award N66001-17-2-4029 and AFRL STTR AF18B-T002.
-  (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
-  (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062. Cited by: §2.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.
Encoder-decoder with atrous separable convolution for semantic image segmentation.
The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2018) Few-shot semantic segmentation with prototype learning. In BMVC, Vol. 3, pp. 4. Cited by: §1, §2, §3.3, Table 4, Table 5, §5, §5, §5.
-  (2010-06) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. Cited by: §2, §5.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70 (ICML), pp. 1126–1135. Cited by: §2.
A short introduction to boosting.
Journal-Japanese Society For Artificial Intelligence14 (771-780), pp. 1612. Cited by: §1.
-  (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1, §3.2.
-  (2011) Semantic contours from inverse detectors. In 2011 International Conference on Computer Vision, pp. 991–998. Cited by: §5.
Deep residual learning for image recognition.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2011) Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 109–117. External Links: Cited by: §4.
-  (2017-07) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2016-06) Microsoft coco: common objects in context. In European conference on computer vision (ECCV), Cited by: §1, §2, §5.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2, §2.
-  (2018) Reptile: a scalable metalearning algorithm. arXiv preprint arXiv:1803.02999 2. Cited by: §2.
-  (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.
-  (2018) Few-shot segmentation propagation with guided networks. arXiv preprint arXiv:1806.07373. Cited by: §2, §3.3, §4, §4, §4.
-  (2018) Conditional networks for few-shot semantic segmentation. ICLR Workshop. Cited by: §1, §2, §3.3, §4, §4, §4, Table 4, Table 5, §5, §5, §5.
-  (2018) Meta-learning for semi-supervised few-shot classification. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention. Cited by: §2.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §4.
-  (2018) Delta-encoder: an effective sample synthesis method for few-shot object recognition. In Advances in Neural Information Processing Systems (NIPS), pp. 2850–2860. Cited by: §2.
-  (2017) One-shot learning for semantic segmentation.. In Proceedings of the 28th British Machine Vision Conference (BMVC), Cited by: §1, §1, §2, Table 4, Table 5, §5, §5, §5.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2, §4.
-  (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NIPS), pp. 4077–4087. Cited by: §2.
-  (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1199–1208. Cited by: §2.
-  (2016) Matching networks for one shot learning. In Advances in neural information processing systems (NIPS), pp. 3630–3638. Cited by: §2.
-  (2018) Low-shot learning from imaginary data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7278–7286. Cited by: §2.
-  (2018) SG-one: similarity guidance network for one-shot semantic segmentation. arXiv preprint arXiv:1810.09091. Cited by: §2, §3.3, §4, §4, §4, Table 4, Table 5.
-  (2017-07) Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.