Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation

by   Yu Liu, et al.

This paper tackles the problem of video object segmentation. We are specifically concerned with the task of segmenting all pixels of a target object in all frames, given the annotation mask in the first frame. Even when such annotation is available this remains a challenging problem because of the changing appearance and shape of the object over time. In this paper, we tackle this task by formulating it as a meta-learning problem, where the base learner grasping the semantic scene understanding for a general type of objects, and the meta learner quickly adapting the appearance of the target object with a few examples. Our proposed meta-learning method uses a closed form optimizer, the so-called "ridge regression", which has been shown to be conducive for fast and better training convergence. Moreover, we propose a mechanism, named "block splitting", to further speed up the training process as well as to reduce the number of learning parameters. In comparison with the-state-of-the art methods, our proposed framework achieves significant boost up in processing speed, while having very competitive performance compared to the best performing methods on the widely used datasets.


page 2

page 3

page 6

page 7

page 8


Meta Learning Deep Visual Words for Fast Video Object Segmentation

Meta learning has attracted a lot of attention recently. In this paper, ...

3D Meta-Segmentation Neural Network

Though deep learning methods have shown great success in 3D point cloud ...

SML: Semantic Meta-learning for Few-shot Semantic Segmentation

The significant amount of training data required for training Convolutio...

Meta-learning with differentiable closed-form solvers

Adapting deep networks to new concepts from few examples is extremely ch...

Deep Meta Learning for Real-Time Visual Tracking based on Target-Specific Feature Space

In this paper, we propose a novel on-line visual tracking framework base...

Meta Segmentation Network for Ultra-Resolution Medical Images

Despite recent progress on semantic segmentation, there still exist huge...

MetaPix: Few-Shot Video Retargeting

We address the task of unsupervised retargeting of human actions from on...

Code Repositories


Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation, IROS2020

view repo

1 Introduction

Fast and accurate video object segmentation plays an important role in many real-world applications, including, but not limited to, film making [12], public surveillance [44], robotic vision [20].

Figure 2: Example result of our technique: The segmentation of the first frame (red) is used to learn the model of the specific object to track, which is segmented in the rest of the frames independently (green). One every 10 frames shown of 50 in total.

The goal of video object segmentation is to distinguish an object of interest over video frames from its background at the pixel level.

In contrast to many vision tasks such as image classification [17]

, face recognition 

[28] and object detection [32, 13] which the performance of the algorithms reach to the point of being suitable for real-world applications, the performance of video object segmentation algorithms are still far beyond the annotations performed by human [30]. This is mainly because this problem does not benefit from availability of a massive corpus of training data, unlike the other aforementioned tasks.

Recently, deep learning-based approaches have shown promising progresses on video object segmentation task 

[3, 42, 24, 37]. However, they still struggle to satisfy both good accuracy and fast processing inference. In this paper, we aim to bridge this gap.

Inspired by the meta-learning method of [2]

, we propose an intuitive yet powerful algorithm for video object segmentation, in which the reference frame is available with its annotated mask. Our objective is to train a system that can “adapt” this annotation information to subsequent frames in a fast yet flexible way at inference time. Specifically, at inference time the reference frame (i.e. one with ground-truth annotation) is mapped to vector in a high dimensional embedding space

using a CNN . We then determine using ridge regression [25], the coefficients of a matrix that best maps to the ground truth, . is then the video-specific “adaptor”, and it maps the feature vectors for every query image (every other image in the video sequence) to their predicted segmentation masks. Training comprises the process of learning the mapping by presenting the network with pairs of images (from a variety of videos but with each pair coming from the same video), each with ground-truth annotation, and back-propagating the loss through . This is illustrated in Figure 3 and described in more detail later in the paper.

We observe that a limitation of the proposed approach is that the ridge regression scales poorly with the dimension of the feature feature produced by because the optimization requires an huge matrix inversion. We address this through the use of a “block splitting” method that approximates the matrix in block diagonal form, meaning the inversion can be done much more efficiently.

Our main contributions are three-fold:

  • A meta-learning based method for video object segmentation is developed, using a closed form solver (ridge regression) as the internal optimizer. This is capable of performing fast gradient back-propagation and can adapt to previously unseen objects quickly with very few samples. Inference (i.e. segmentation of the video) is a single forward pass per frame with no need for fine-tuning or post-processing.

  • Ridge regression in high-dimensional feature spaces can be very slow, because of the need to invert a large matrix. We address this using a novel block splitting mechanism which we show greatly speeds the training process without damaging the performance.

  • We demonstrate state-of-the-art video segmentation accuracy relative to all others methods of comparable processing time, and even better accuracy than many slower ones (see Figure 1).

Figure 3: Workflow of the proposed method. An image pair sampled from the same video as the input to the network. The first image and its annotation as the reference frame, and the second image and its annotation ( or prediction during inference) as the query frame. The image pair first passes through the feature extractor (DeepLabv2 [4] with ResNet101 [14]

) to compute a 800D embedding tensor

. Then a mapping matrix W between and is calculated in the reference frame (Eq. 1) using ridge regression. After that, the prediction result in the query frame is acquired by multiplying and (Eq. 2). During training, the loss error between and is back-propagated to enhance the network’ adaptation ability between the reference frame and the query frame. During inference, the reference frame ( and ) is always the first frame, and the query image is the rest sequence from the same video. Through iterative meta-learned, our network is capable of quickly adapting to unseen target object(s) with a few examples.

2 Related Works

2.1 Semi-supervised Video Object Segmentation

The goal of video object segmentation is to ‘cutout’ the target object(s) from the entire input video sequence. Regarding the amount of supervision utilized for video object segmentation, methods can be roughly put into two spectrum, i.e. semi-supervised and unsupervised methods.

For semi-supervised video object segmentation, the annotated mask of the first frame is given, and the algorithm is designed to predict the masks of the rest frames in the video. There are three categories in this spectrum. The first one, which include MSK [29], MPNVOS [37] etc, is to use optical flow to track the mask from the previous frame to the current frame. Similarly, the second category formulates the optical flow and segmentation in two parallel branches, and utilizes the predicted mask from the previous frame as a guidance, some representatives are Segflow [7],VSOF [40], RGMP [43], OSNM [45] etc. The final class which keeps the state-of-the-art performance on Davis benchmark [30] is to try to over-fit the appearance of the target object(s), and expect the method can generalize in the subsequent frames. Specifically, OSVOS [3] uses one-shot learning mechanism to conduct fine-tuning on the first frame of test video to capture the appearance of the target object(s), and conduct inference on the rest frames. The drawbacks of OSVOS are: (1) it can not adapt to the unseen parts (2) when dramatic changes of appearance happen in subsequent frames, the method’s performance significantly degrade. Inspired by the overall design of OSVOS, there are some following methods which employ various additional ingredients to improve the segmentation accuracy. In particular, OSVOS-S [24] combines the semantic instance information to remove the noisy objects coming from the same category. OnVOS [42] utilizes on-line adaption mechanism to overcome the limits of OSVOS when drastic appearance changes occur. CINM [1]

utilizes a CNN-based markov random field (MRF) to estimate the probabilities of the pixels belonging to the target object(s) in spatial domain, and employs optical flow to track segmented pixels in temporal domain. Albeit those methods improve the segmentation performance of OSVOS to some extent, they are still time-consuming during inference since the on-line fine-tuning is necessary. And usually, at lease a dense CRF 

[16] and more techniques are applied as the post-processing step to acquire the better segmentation results.

In this paper, we mainly target to fast video object segmentation, since no optical flow and fine-tuning processes are used, the proposed method is appropriate for real-world applications.

2.2 Meta Learning

Meta learning is also named learning to learn [35, 26, 38] because its goal is to help the machine to be capable of learning quickly, especially in the case with very few samples for the new task(s). Generally speaking, meta learning algorithms are composed of two components, i.e. base learner and meta learner. According to their roles, base learner is mainly in charge of handling with individual tasks, and meta learner is much like a coordinator, through learning individuals tasks, meta learner can boost the performance of base learners across the tasks.

Meta-learning is an alternative to the de-facto solution that has emerged in deep learning of pre-training a network using a large, generic dataset (eg ImageNet 

[8]) followed by fine-tuning with a problem-specific dataset. Meta-learning aims to replace the fine-tuning stage (which can still be very expensive) by training a network that has a degree of plasticity so that it can adapt rapidly to new tasks. For this reason it has become a very active area recently, especially with regard to one-shot and few-shot learning problems [18, 9]

Recent approaches for meta-learning can be roughly put into three categories: (i) metric learning for acquiring similarities [41, 36, 11]; (ii) learning optimizers for gaining update rules [10, 31]; and (iii) recurrent networks for reserving the memory [33, 15]. In this work, we adopt the meta-learning algorithm that belongs to the category of learning optimizers. Specifically, inspired by [2] which was originally designed for image classification, we adopt ridge regression, which is a closed-form solution to the optimization problem. The reason for using it is because, compared with the widely-used SGD [19] in CNNs, ridge regression can propagate gradient efficiently, which is matched with the goal of fast mapping. Through extensive experiments, we demonstrate that the proposed method is in the first echelon regarding to speed for fast video object segmentation, while obtaining more accurate results without any post-processing.

2.3 Fast Video Object Segmentation

A few previous methods proposed to tackle fast video object segmentation. In particular, FAVOS [6] first tracks the part-based detection. Then, based on the tracked box, it generates the part-based segments and merges those parts according to a similarity score to form the final segmentation results. The drawback of FAVOS is that it can not be learned in an end-to-end manner, and heavily relies on the part-based detection performance. OSNM [45] proposes a model which is composed of a modulator and a segmentation network. Through encoding the mask prior, the modular can help the segmentation network quickly adapt to the target object. RGMP [43] shares the same spirit with OSNM. Specifically, it employs a Siamese encoder-decoder structure to utilize the mask propagation, and further boosts the performance with synthetic data. The most similar work to ours is PML [5], which formulates the problem as a pixel-wise metric learning problem. Through the FCN [23], it maps the pixels to high-dimensional space, and utilizes a revised triplet loss to encourage pixels belonging to the same object much closer than those belonging to different objects. Nearest neighbor (NN) is required for retrieval during inference. In contrast our meta-learning approach acquires a mapping matrix between the high-dimensional feature and annotated mask in reference image using ridge regression, and then can be adapted rapidly to generate the prediction mask. Compared to baseline method PML [5], our method achieves more accurate performance and is twice the speed. And with the same efficiency, the J mean of our method is 3.4 percent better than OSNM [45] on the DAVIS2016 [30] validation set.

3 Methodology

3.1 Overview

We formulate the video object segmentation as a meta-learning problem. For each image pair which comes from a same video, ridge regression is used as the optimizer to learn the base learner. Meta learner is naturally built through the training process. Once the meta learner is learned, it possesses the ability of fast mapping between the image features and object masks, and can be adapted to unseen objects quickly with the help of the reference image.

According to the phase that user input involved in the training loop, the current existing methods can be classified into three categories.

User input outside the network training loop This category utilizes the user input to fine-tune the network to over-fit the appearance cues of target object(s) during inference. The representatives are OSVOS [3] and its following works [24, 1, 42]. Since online fine-tuning is required during inference, the drawback of these algorithms is time-consuming, which usually take seconds per image, thus is not practical for the real-world applications.

User input within the network training loop This category of work injects the user input as the additional input for training the network. Through this way, no online fine-tuning is needed. These algorithms incorporate the user input either by using a parallel network or concatenating the image with the user input [43, 45]. One drawback of this kind of methods is that the model needs to be recalculated once the user input changes, thus it is not practical for adaptation especially for long videos.

User input is detached from the network training loop In contrast to the previous methods, our algorithm shares the same spirit with PML [5] in design. The network and user input are detached, and the user input can be more flexible. Moreover, once the user input is given (for example, the annotation in the reference image), the network can quickly adapt to the target objects without any extra operations.

3.2 Segmentation as Meta-Learning

For simplicity, we assume single-object segmentation case, and the annotation of first frame is given as the user input. Note that our method can also be applied for multi-objects and easily extended to other types of user input, e.g., scribble, clicks etc.

We adopt the following notation:

denotes the number of feature channels (in our case 800).
denote the spatial resolution of the extracted features (in our case 1/8th of the orginal image size).
are the feature tensors of size produced by
is a flattened tensor of or , with shape
is the flattened tensor of annotation mask or , with shape
denotes the mapping matrix of size between the feature space and annotation mask.

As noted above, there are two components to the learner: (i) an embedding model that maps images to a high-dimensional feature space, ; and (ii) an adaptor of size , found using ridge regression, that maps the embedded features to a (flattened) segmentation mask (of size ).

Embedding Model We adopt DeeplabV2 [4] built on the ResNet-101 [14] backbone structure as our feature extractor . This choice allows a direct comparison of our method with the baseline, PML [5]. First, we use the pretrained model on COCO [22] dataset as the initialization for semantic segmentation. Then the ASPP [4] layer for classification is removed and replaced by our video-specific mapping .

Ridge Regression

Ridge regression is a closed form solver and widely-used in machine learning community 

[34, 27]. The learner seeks that minimizes as follows:


where, and are as defined above, and is a regularization parameter, and set to 5.0 in all of our experiments. As can be seen in Figure 3, during training, an image pair as well as their annotations are sampled from the same video sequence. The feature extracted from the reference image (in the figure this is the first image) and its annotation will be used to calculate the mapping matrix .


(where we abuse notation and use the unflattened feature tensors for clarity)

For the query image , likewise we compute the feature , map these to the predicted segmentation mask using Equation 2 in which is the matrix computed from the reference image and its ground truth. The loss between the prediction mask and the annotation for the query provides the back-propagation signal to improve ’s ability to produce adaptable features.

During inference in our case, the reference frame will be always the first frame, for which the annotation mask is provided, and the query frames will be the rest of frames in the same video.

3.3 Block Splitting

Thanks to ridge regression, the computation of the mapping matrix and gradient back-propagation are already very fast compared with other algorithms, which also focus on video object segmentation.


During the experiments, we found the higher dimension of the feature used as the input for meta-learning module, the more accurate segmentation results likely be achieved. However, we also observed that the higher dimension of the feature being utilized, the slower of the training process. Specifically, during the computation of mapping matrix W, it involves a matrix inverse calculation. as denoted by Equation 3, which will become the bottleneck of fast propagation when the very high dimensional feature is used.

In order to further speed up the training process of the proposed network, we deliver a block splitting mechanism, and its work principle as shown in Figure 4. In particular, , our motivation is that the matrix inverse computation for much high-dimensional feature (eg. 800D) can be approximated by the sum of the computations of that relative low-dimensional features (eg. 200D 4). From the work principle, it can be viewed that a matrix can be approximated by four irrelevant diagonal matrix.

The advantages of using the proposed block splitting mechanism are: Firstly, it can largely speed up the matrix inverse process involved in ridge regression, thus it saves the training time to some extent. Secondly, through the matrix approximation step as aforementioned, the network parameters involved in the ridge regression as well as memory utilized in our network are reduced. The experimental evidence can be found in Ablation Study ( Section 5).

3.4 Training

Training Strategy For training, optimizer is SGD with momentum 0.9, with weight decay 5e-4. We use the DeepLabV2 [4] with backbone network ResNet-101 [14] as the feature extractor, and the constant learning rate, i.e. 1.0e-5, is used during the whole training process. The dimension of extracted feature is 800 outputed by the feature extractor, which is used as the input for the meta-learning module.

Loss BCEWithLogitsLoss111 is employed for training the proposed network, it essentially is a combination of the Sigmoid layer and binary cross entropy (BCE) loss, it benefits from the log-sum-exp trick for numerical stability. And compared to BCE loss, it is more robust and less likely to cause numerical problem when computing the inverse matrix in the ridge regression step.

Figure 4: Illustration of the proposed block splitting: during matrix inverse calculation of ridge regression, the computation of the higher dimensional feature is approximated by the sum of computation of that lower dimensional features. Which can effectively speed up the training process as well as reducing the parameters and memory.

where is the batch size. is the input of the loss calculation, and () is the ground truth label. is a rescaling weight given to the loss of each batch element.

Figure 5: Qualitative results: Homogeneous sample of DAVIS sequences with our result overlaid
Method DAVIS16 Online-Tuning OptFlow CRF BS Speed(s)
OFL 68.0 - 42.2
BVS 60.0 - 0.37
ConvGRU 70.1 20
VPN 70.2 0.63
MaskTrack-B 63.2 - 0.24
SFL-B 67.4 0.30
OSVOS-B 52.5 0.14
OSNM 72.2 0.14
PML 75.5 0.28
Ours 75.8 0.145
PLM 70.0 0.50
SFL 74.8 7.9
MaskTrack 69.8 12
OSVOS 79.8 10
Table 1: Performance comparison of our approach with recent approaches on DAVIS 2016 Performance measured in mean IoU.

4 Experiments

Figure 6: Per-sequence results of mean region similarity (J ) . Sequences are sorted by our performance.

4.1 Dataset

We verify the proposed method both on DAVIS2016 [30] and SegTrack v2 [21] datasets.

On DAVIS2016, which contains 50 pixel-level annotated video sequences, and each video only contains one target object for segmenting. Among these 50 video sequences, 30 video sequences as the training set with which the annotated mask is provided for every frame. And another 20 video sequences as the validation set, and only the annotation of the first frame is allowed to access.

SegTrack v2 [21] is extended from SegTrack [39] dataset. Both of them contain the dense pixel-level annotation for each frame within each video. For segtrack v2 dataset, we test our algorithm on all the sequences which contain one target object.

4.2 Results on DAVIS2016

Quantitative Results Table 1 shows the experimental results on DAVIS2016 [30] on different methods. Apart from the performance (measured by J mean), switches for online-fining, using optical-flow, dense CRF (CRF) and boundary snapping (BS) are also described. Meanwhile, the inference time is also shown. In particular, compared with most of the competitors, our algorithm shares the same or much faster processing time with superior performance regarding the segmentation accuracy. Please note that, some methods which use much stronger backbone networks are not listed out for the purpose of fair comparison. Compared with OSVOS [3], for which the online fine-tuning is necessary, our method just takes a smaller fraction of time to do inference. Compared to the baseline method PML  [5] which use the same feature extractor, our method is twice faster and with better performance. Compared OSNM [45], with the same efficiency, our method achieve 3.4 percent improvements regarding to the segmentation accuracy.

Qualitative Results

Figure 5 demonstrates some visualized results of our method. As shown in Figure 5, our method is not only good at recovering object details (e.g., the results on the sequence of blackswan), but also robust against heavy occlusions (eg. the results on the sequences bmx-bumps and libby, dramatic movement as well as abrupt rotation (eg. the results on the sequence motocross-bumps). However, there are very few scenarios which may lead to failure cases (denoted by the red box), and mainly caused by the (noisy) objects which have not appeared at the first frame of the video, and can be easily cured by some post-processing steps, including tracking [6], online adaptation [42, 5].

Figure 7: Visualized comparison between the proposed method and other methods. With the red box to denote the error region.

In Figure 7, we show some visualized results compared with OSVOS [3] and PML [5]. For the breakdance, scooter-black and dance-jump sequences, which contain fast moving and abrupt rotation, OSVOS [3] performs worse than PML [5]. And for the dog sequence, PML [5] can not achieve a satisfied result due to the dramatic change of the light conditions. However, on both of these two scenarios, the proposed method performs better than both of OSVOS and PML, which is benefit from robust adaptation ability of our network.

4.3 Results on SegTrack Dataset

Figure 8: Qualitative results: Homogeneous sample of SegTrack sequences with our result overlaid

In Figure 8, some visualized results in the segTrack [39] dataset are shown. Which are acquired by direcly utlized the model trained on Davis2016 dataset. As can be seen, in most cases, our model maintain a good segmentation accuracy, and with a few case fails (as denoted by the red box), which mainly due to the dramatically changes of the light conditions and exact same appearance between the background and the target object. These results prove our method has a better generalization ability and can be quickly adapted to other unseen objects with very few examples (here, only the annotation in the first frame is provided).

5 Ablation Study

5.1 Feature Dimension and Block Splitting

Split No
Feature Speed Memory Computation Cost

800 1.50 11590 640k

400 1.23 11720 320k

200 0.75 11580 160k

100 0.86 11584 80k

Table 2: Ablation study on block splitting: feature dimension, running speed, memory and computation cost with different settings are listed out.

As mention in Section 3.3, since our meta learning module (ridge regression) requires the computation of matrix inverse, the training speed will varies significantly regrading the features with various dimensions utilized for this step. And based on the fact that low dimensional features usually have the faster speed but lose some details of image information. On the contrary, high dimensional features are time-consuming but carry much rich information. We propose a block splitting mechanism to train the meta learner. In Table 2, the splitting number (of feature), feature dimension, running speed (per iteration), memory cost (of the whole network), as well as computation cost (of the computation of matrix inverse) with different settings are listed out. As can be seen, with the feature dimension decreasing, the overall trend are running speed increasing, computation cost decreasing, dramatically. However the memory cost reduce slightly, which mainly because of the backbone feature extractor take up most of the memory usage. All the numbers are tested on the single GPU card (with type of GTX 1080).

5.2 Per Sequence Performance Analysis

In Figure 6, J mean of per sequence of different methods are outlined. It is sorted according our algorithm’s performance in each sub-sequence, which provides a more intuitive understanding for the proposed algorithm. Firstly, the proposed method achieve a better video segmentation accuracy when compared to many other methods. Secondly, our algorithm works quite well on most of sequences, even on the most challenging sequences, e.g., breakdance and bmx-tree, the J mean is above 0.5.Thirdly, benefit from the quick adaption ability of meta-learning, around half of sequence achieve J mean over 0.8. Moreover, our method can well recover the object details as well as robust against fast movement and heavy occlusion, which are aligned with our conclusion in Section 4.2

6 Conclusion

In this paper, we explore applying meta-learning into video object segmentation system. A closed form optimizer, i.e., ridge regression, is utilized to update the meta learner, which achieves fast speed while maintains the superior accuracy. Through iteratively meta-learned, the network is capable of conducting fast mapping

on unseen objects with a few examples available. Compared to the fine-tuning methods, our algorithm with similar performance but just a smaller fraction time is required, which is appeal to the real-world applications. In addition, a block splitting mechanism is delivered to speed up the training process, which also has the benefits of reducing parameters and saving memory. In future work, we would like to use other basic optimizers, such as, Newton’s methods and logistic regression. Meanwhile, based on the flexible design of our meta-learner, instead of inferring the rest frames from the given whole annotation of the first frame. Inferring whole object from only part of annotation or user feedback is also worth to investigate.


  • [1] L. Bao, B. Wu, and W. Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5977–5986, 2018.
  • [2] L. Bertinetto, J. F. Henriques, P. H. Torr, and A. Vedaldi. Meta-learning with differentiable closed-form solvers. arXiv preprint arXiv:1805.08136, 2018.
  • [3] S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR 2017. IEEE, 2017.
  • [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [5] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1189–1198, 2018.
  • [6] J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang. Fast and accurate online video object segmentation via tracking parts. arXiv preprint arXiv:1806.02323, 2018.
  • [7] J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang. Segflow: Joint learning for video object segmentation and optical flow. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 686–695. IEEE, 2017.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [9] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4):594–611, 2006.
  • [10] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
  • [11] V. Garcia and J. Bruna. Few-shot learning with graph neural networks. arXiv preprint arXiv:1711.04043, 2017.
  • [12] J. P. Gee. Deep learning properties of good digital games: how far can they go? In Serious Games, pages 89–104. Routledge, 2009.
  • [13] R. Girshick. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [15] Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to remember rare events. arXiv preprint arXiv:1703.03129, 2017.
  • [16] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Advances in neural information processing systems, pages 109–117, 2011.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [18] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
  • [19] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [20] I. Lenz, H. Lee, and A. Saxena. Deep learning for detecting robotic grasps. The International Journal of Robotics Research, 34(4-5):705–724, 2015.
  • [21] F. Li, T. Kim, A. Humayun, D. Tsai, and J. M. Rehg. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013.
  • [22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [24] K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. Video object segmentation without temporal information. arXiv preprint arXiv:1709.06031, 2017.
  • [25] R. H. Myers and R. H. Myers. Classical and modern regression with applications, volume 2. Duxbury Press Belmont, CA, 1990.
  • [26] D. K. Naik and R. Mammone. Meta-neural networks that learn by learning. In Neural Networks, 1992. IJCNN., International Joint Conference on, volume 1, pages 437–442. IEEE, 1992.
  • [27] I. Nouretdinov, T. Melluish, and V. Vovk. Ridge regression confidence machine. In ICML, pages 385–392, 2001.
  • [28] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
  • [29] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung. Learning video object segmentation from static images. In Computer Vision and Pattern Recognition, volume 2, 2017.
  • [30] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [31] S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
  • [32] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [33] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with memory-augmented neural networks. In International conference on machine learning, pages 1842–1850, 2016.
  • [34] C. Saunders, A. Gammerman, and V. Vovk. Ridge regression learning algorithm in dual variables. 1998.
  • [35] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. PhD thesis, Technische Universität München, 1987.
  • [36] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
  • [37] J. Sun, D. Yu, Y. Li, and C. Wang. Mask propagation network for video object segmentation. arXiv preprint arXiv:1810.10289, 2018.
  • [38] S. Thrun and L. Pratt. Learning to learn. Springer Science & Business Media, 2012.
  • [39] D. Tsai, M. Flagg, A. Nakazawa, and J. M. Rehg. Motion coherent tracking using multi-label mrf optimization. International journal of computer vision, 100(2):190–202, 2012.
  • [40] Y.-H. Tsai, M.-H. Yang, and M. J. Black. Video segmentation via object flow. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [41] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
  • [42] P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
  • [43] S. Wug Oh, J.-Y. Lee, K. Sunkavalli, and S. Joo Kim. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7376–7385, 2018.
  • [44] Z. Xu, C. Hu, and L. Mei. Video structured description technology based intelligence analysis of surveillance videos for public security applications. Multimedia Tools and Applications, 75(19):12155–12172, 2016.
  • [45] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. algorithms, 29:15, 2018.