One-Shot Weakly Supervised Video Object Segmentation

Conventional few-shot object segmentation methods learn object segmentation from a few labelled support images with strongly labelled segmentation masks. Recent work has shown to perform on par with weaker levels of supervision in terms of scribbles and bounding boxes. However, there has been limited attention given to the problem of few-shot object segmentation with image-level supervision. We propose a novel multi-modal interaction module for few-shot object segmentation that utilizes a co-attention mechanism using both visual and word embeddings. It enables our model to achieve 5.1 previously proposed image-level few-shot object segmentation. Our method compares relatively close to the state of the art methods that use strong supervision, while ours use the least possible supervision. We further propose a novel setup for few-shot weakly supervised video object segmentation(VOS) that relies on image-level labels for the first frame. The proposed setup uses weak annotation unlike semi-supervised VOS setting that utilizes strongly labelled segmentation masks. The setup evaluates the effectiveness of generalizing to novel classes in the VOS setting. The setup splits the VOS data into multiple folds with different categories per fold. It provides a potential setup to evaluate how few-shot object segmentation methods can benefit from additional object poses, or object interactions that is not available in static frames as in PASCAL-5i benchmark.



page 5


Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Inputs

Significant progress has been made recently in developing few-shot objec...

FlowVOS: Weakly-Supervised Visual Warping for Detail-Preserving and Temporally Consistent Single-Shot Video Object Segmentation

We consider the task of semi-supervised video object segmentation (VOS)....

Weakly Supervised Semantic Image Segmentation with Self-correcting Networks

Building a large image dataset with high-quality object masks for semant...

Temporal Transductive Inference for Few-Shot Video Object Segmentation

Few-shot video object segmentation (FS-VOS) aims at segmenting video fra...

Bayesian Joint Modelling for Object Localisation in Weakly Labelled Images

We address the problem of localisation of objects as bounding boxes in i...

Bayesian Joint Topic Modelling for Weakly Supervised Object Localisation

We address the problem of localisation of objects as bounding boxes in i...

Learning Video Object Segmentation from Unlabeled Videos

We propose a new method for video object segmentation (VOS) that address...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The few-shot learning literature has mainly focused on the classification task such as Koch et al. (2015)Vinyals et al. (2016)Snell et al. (2017)Qi et al. (2017)Finn et al. (2017)Ravi and Larochelle (2016)Sung et al. (2018)Qiao et al. (2018). Recently, solutions for few-shot object segmentation which learns pixel-wise classification based on few labelled samples have emerged Shaban et al. (2017)Rakelly et al. (2018)Dong and Xing (2018)Wang et al. (2019)Zhang et al. (2019a)Zhang et al. (2019b)Siam et al. (2019). Previous literature in few-shot object segmentation relied on manually labelled segmentation masks. A few recent works Rakelly et al. (2018)Zhang et al. (2019b)Wang et al. (2019) started to conduct experiments using weak annotations in terms of scribbles, bounding boxes. However, limited research was conducted on using image-level supervision for few-shot object segmentation with one sole recent work Raza et al. (2019).

In order to improve image-level few-shot object segmentation we propose a multi-modal interaction module that leverages the interaction between support and query visual features and word embeddings. The use of neural word embeddings pretrained on GoogleNews Mikolov et al. (2013)

and visual embeddings from pretrained networks on ImageNet 

Deng et al. (2009) allows to build such a model. We propose a novel approach that uses Stacked Co-Attention to leverage the interaction among visual and word embeddings. We mainly inspire from Lu et al. (2019) which proposed a co-attention siamese network for unsupervised video object segmentation. However, our setup mainly focuses on the few-shot object segmentation aspect to assess its ability to generalize to novel classes. This aspect motivates why we meta-learned a multi-modal interaction module that leverages the interaction between support and query with image-level supervision.

Concurrent to few-shot object segmentation, video object segmentation(VOS) has been extensively researched Pont-Tuset et al. (2017)Perazzi et al. (2016)Lu et al. (2019)Tokmakov et al. (2016)Jain et al. (2017)Luiten et al. (2018). Two main categories for video object segmentation are studied which are semi-supervised VOS and unsupervised VOS. The semi-supervised VOS also named as few-shot requires an initial strongly labelled segmentation mask to be provided. In a recent work Khoreva et al. (2018) video object segmentation using language expression has been studied. However, their work does not focus on the aspect of segmenting novel categories which we try to address since it can be of potential benefit to the few-shot learning community. The additional temporal information in VOS can provide extra unlabelled data about the object poses or its interactions with other objects to the few-shot object segmentation method. It has the potential to be closer to human-like learning rather than learning from a single frame. To the best of our knowledge we are the first to propose a novel setup for video object segmentation that focuses on the ability to generalize to novel classes by splitting the categories to different folds which would better assess the generalization ability. The setup as well only requires image-level labels for the first frame to segment the corresponding objects unlike conventional semi-supervised VOS. Although Youtube-VOS Xu et al. (2018) provided a way to evaluate on certain unseen categories but it does not utilize any of the category label information in the segmentation model. However, in order to ensure the evaluation for the few-shot method is not biased to a certain category, it is best to split into multiple folds and evaluate on different ones. To this end we use Youtube-VOS dataset Xu et al. (2018) which is a large-scale VOS dataset with 65 different object categories, the originally provided training data is further split into 5 different folds with 13 novel classes per fold. The novel setup opens the door towards studying how few-shot object segmentation can benefit from temporal information for the different object poses or its interactions. The main contributions in this paper are summarized as follows:

  • A novel formulation for learning image-level labelled few-shot segmentation based on a multi-modal interaction module is presented. It utilizes neural word embeddings and attention mechanisms that relates the support and query images. It enables attention to the most relevant spatial locations in the query feature maps.

  • We propose a novel setup called few-shot weakly supervised video object segmentation that can be of potential benefit to few-shot object segmentation.

2 Method

2.1 Stacked Co-Attention

A simple conditioning on the support set image would be insufficient where neither sparse nor dense annotations is provided. In order to leverage the interaction between the support set images and query set image a conditioned support-to-query attention module is used to learn the correlation between them. Initially a base network is used to extract features from support set image and query image which we denote as and . Where and denote the feature maps height and width respectively, while denote the feature channels.

It is important to initially capture the class label semantic representation through using semantic word embeddings Mikolov et al. (2013) before performing the co-attention. The main reason is that both the support set and query set can contain multiple common objects from different classes, so depending solely on the support-to-query attention will fail in this case. A projection layer is used on the semantic word embeddings to construct where is 256. It is then spatially tiled and concatenated with the visual features resulting in and

. An affinity matrix

is then computed to capture the similarity between them using equation 1.


The feature maps are flattened into matrix representations where and , while learns the correlation between feature channels. We use a vanilla co-attention similar to Lu et al. (2019) where is learned using a fully connected layer. The resulting affinity matrix relates each column from and . A softmax operation is performed on the row-wise and column-wise depending on what relation direction we are interested in following equations 2a and 2b.


has the relevance of the spatial location in with all spatial locations of , where . The normalized affinity matrix can be used to compute using equation 3 and similarly. and act as the attention summaries.


The attention summaries are further gated using a gating function with learnable weights and bias following equations 4a and 4b

. The gating function ensures the output to be in the interval [0, 1] in order to mask the attention summaries using a sigmoid activation function

. The operator denotes the hadamard product or element-wise multiplication.


The gated attention summaries are concatenated with the original visual features and reshaped back to construct the final output from the attention module. Figure 1 demonstrates our proposed method. We utilize a ResNet-50 He et al. (2016) encoder pre-trained on ImageNet Deng et al. (2009) to extract visual features. The segmentation decoder is comprised of an iterative optimization module (IOM) Zhang et al. (2019b) and an atrous spatial pyramid pooling (ASPP) Chen et al. (2017a)Chen et al. (2017b). In order to improve our model we use two stacked co-attention modules. It allows the model to learn a better representation by letting the support set guide the attention on the query image and the reverse with respect to the support set through multiple iterations.

Figure 1: Architecture of Few-Shot Object segmentation model with co-attention. The operator denotes concatenation, denotes element-wise multiplication, denotes matrix multiplication.

2.2 Few-Shot Weakly Supervised Video Object Segmentation

We propose a novel setup for few-shot video object segmentation where we utilize image-level labelled first frame to learn object segmentation in the sequence rather than using manual segmentation masks. More importantly the setup is devised in a way to split the categories to multiple folds to assess the generalization ability to novel classes. We utilize Youtube-VOS dataset training data which has 65 categories, and we split into 5 folds. Each fold has 13 classes that are used as novel classes, while the rest are used in the meta-training phase. A randomly sampled category and sequence is used to construct support set and query images .

3 Experimental Results

In this section we demonstrate results from experiments conducted on the PASCAL- dataset Shaban et al. (2017) which is the most widely used dataset for evaluating few-shot segmentation. We also conduct experiments on our novel few-shot weakly supervised video object segmentation.

3.1 Experimental Setup

We report two evaluation metrics, the mean-IoU computes the intersection over union for all 5 classes within the fold and averages them neglecting the background 

Shaban et al. (2017). Whereas the binary-IoU metric proposed in Rakelly et al. (2018)Dong and Xing (2018) computes the mean of foreground and background IoU in a class agnostic manner. Both metrics are reported as an average of 5 different runs to ensure a stable result following Wang et al. (2019). We have also noticed some deviation in the validation schemes used in previous works.  Zhang et al. (2019b) follows a procedure where the validation is performed on the classes to save the best model whereas Wang et al. (2019) does not perform validation and rather trains for a fixed number of iterations. We choose the approach followed in Wang et al. (2019) since we consider that the classes are not available during the initial meta-training phase.

3.2 Few-shot Weakly Supervised Object Segmentation

We compare the result of our proposed method with stacked co-attention against the other state of the art methods for 1-way 1-shot segmentation on pascal- in Table 1 using mean-IoU and binary-IoU metrics. We report the results for both the two validation schemes where V1 is following Zhang et al. (2019b) and V2 is following Wang et al. (2019) validation scheme. Without the utilization of segmentation mask or even sparse annotations, our method with the least supervision of image level labels performs relatively on-par 53.5% compared to the current state of the art methods 56% showing the efficacy of our proposed algorithm. It outperforms the previous one-shot weakly supervised segmentation Raza et al. (2019) with 5.1%. Our proposed model outperforms the baseline method which utilizes a co-attention module without using word embeddings. Figure 2 shows the qualitative results for our proposed approach.

Method W fold 0 fold 1 fold 2 fold 3 mean-IoU binary-IoU
FG-BG - - - - - 55.1
Shaban et al. (2017) 33.6 55.3 40.9 33.5 40.8 -
Rakelly et al. (2018) 36.7 50.6 44.9 32.4 41.1 60.1
Dong and Xing (2018) - - - - - 61.2
Siam et al. (2019) 41.9 50.2 46.7 34.7 43.4 62.2
Wang et al. (2019) 42.3 58.0 51.1 41.2 48.1 66.5
Zhang et al. (2019b) 52.5 65.9 51.3 51.9 55.4 66.2
Zhang et al. (2019a) 56.0 66.9 50.6 50.4 56.0 69.9
Raza et al. (2019) - - - - - 58.7
Baseline 38.6 56.6 43.8 38.2 44.3 60.2
Ours-V1 49.4 65.5 50.0 49.2 53.5 65.6
Ours-V2 42.1 65.1 47.9 43.8 49.7 63.8
Table 1: Quantitative results for 1-way, 1-shot segmentation on the PASCAL- dataset showing mean-Iou Shaban et al. (2017) and binary-IoU Rakelly et al. (2018)Dong and Xing (2018). W: stands for using weak supervision from Image-Level labels.
(a) ’bicycle’
(b) ’bottle’
(c) ’bird’
Figure 2: Qualitative evaluation on PASCAL- 1-way 1-shot. The support set and prediction on the query image are shown in pairs.

3.3 Few-shot Weakly Supervised Video Object Segmentation

Table 2 shows results on our proposed novel setup and comparing our method with the baseline of using co-attention module without utilizing word embeddings similar to Lu et al. (2019). It shows the potential benefit from utilizing neural word embeddings to guide the attention module.

Method fold 0 fold 1 fold 2 fold 3 fold 4 Mean
Baseline 40.1 33.7 47.1 36.4 36.6 38.8
Ours 41.6 40.8 51.4 41.5 39.1 42.9
Table 2: Quantitative Results on Youtube-VOS One-shot weakly supervised setup.

4 Conclusions

Our proposed method demonstrates great promise toward performing few-shot object segmentation based on gated co-attention that leverages the interaction between the support set and query image. Our model utilizes neural word embeddings to guide the attention mechanism which improves the segmentation accuracy compared to the baseline. We demonstrate promising results on Pascal- where we outperform the previously proposed image-level labelled one-shot segmentation method by 5.1% and perform closer to methods that use strongly labelled masks. Our novel setup provides a mean to experiment with the effect of capturing different object viewpoints, and appearance changes in few-shot object segmentation. It closely mimics human learning for novel objects from few labelled samples by aggregating information from different viewpoints and capturing different object interactions.


  • L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017a) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §2.1.
  • L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017b) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.1.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §1, §2.1.
  • N. Dong and E. P. Xing (2018) Few-shot semantic segmentation with prototype learning. In BMVC, Vol. 3, pp. 4. Cited by: §1, §3.1, Table 1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1126–1135. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.1.
  • S. D. Jain, B. Xiong, and K. Grauman (2017) Fusionseg: learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384. Cited by: §1.
  • A. Khoreva, A. Rohrbach, and B. Schiele (2018) Video object segmentation with language referring expressions. In Asian Conference on Computer Vision, pp. 123–141. Cited by: §1.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015)

    Siamese neural networks for one-shot image recognition


    ICML Deep Learning Workshop

    Vol. 2. Cited by: §1.
  • X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli (2019) See more, know more: unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3623–3632. Cited by: §1, §1, §2.1, §3.3.
  • J. Luiten, P. Voigtlaender, and B. Leibe (2018) PReMVOS: proposal-generation, refinement and merging for video object segmentation. In Asian Conference on Computer Vision, pp. 565–580. Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §2.1.
  • F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, Cited by: §1.
  • J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation. arXiv:1704.00675. Cited by: §1.
  • H. Qi, M. Brown, and D. G. Lowe (2017) Learning with imprinted weights. arXiv preprint arXiv:1712.07136. Cited by: §1.
  • S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229–7238. Cited by: §1.
  • K. Rakelly, E. Shelhamer, T. Darrell, A. Efros, and S. Levine (2018) Conditional networks for few-shot semantic segmentation. Cited by: §1, §3.1, Table 1.
  • S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. Cited by: §1.
  • H. Raza, M. Ravanbakhsh, T. Klein, and M. Nabi (2019) Weakly supervised one shot segmentation. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1, §3.2, Table 1.
  • A. Shaban, S. Bansal, Z. Liu, I. Essa, and B. Boots (2017) One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410. Cited by: §1, §3.1, Table 1, §3.
  • M. Siam, B. N. Oreshkin, and M. Jagersand (2019) AMP: adaptive masked proxies for few-shot segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5249–5258. Cited by: §1, Table 1.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077–4087. Cited by: §1.
  • F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199–1208. Cited by: §1.
  • P. Tokmakov, K. Alahari, and C. Schmid (2016) Learning motion patterns in videos. arXiv preprint arXiv:1612.07217. Cited by: §1.
  • O. Vinyals, C. Blundell, T. Lillicrap, and D. Wierstra (2016) Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630–3638. Cited by: §1.
  • K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng (2019) PANet: few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9197–9206. Cited by: §1, §3.1, §3.2, Table 1.
  • N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang (2018) Youtube-vos: a large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327. Cited by: §1.
  • C. Zhang, G. Lin, F. Liu, J. Guo, Q. Wu, and R. Yao (2019a) Pyramid graph networks with connection attentions for region-based one-shot semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9587–9595. Cited by: §1, Table 1.
  • C. Zhang, G. Lin, F. Liu, R. Yao, and C. Shen (2019b) CANet: class-agnostic segmentation networks with iterative refinement and attentive few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5217–5226. Cited by: §1, §2.1, §3.1, §3.2, Table 1.