MetaSeg
Meta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation, IROS2020
view repo
This paper tackles the problem of video object segmentation. We are specifically concerned with the task of segmenting all pixels of a target object in all frames, given the annotation mask in the first frame. Even when such annotation is available this remains a challenging problem because of the changing appearance and shape of the object over time. In this paper, we tackle this task by formulating it as a meta-learning problem, where the base learner grasping the semantic scene understanding for a general type of objects, and the meta learner quickly adapting the appearance of the target object with a few examples. Our proposed meta-learning method uses a closed form optimizer, the so-called "ridge regression", which has been shown to be conducive for fast and better training convergence. Moreover, we propose a mechanism, named "block splitting", to further speed up the training process as well as to reduce the number of learning parameters. In comparison with the-state-of-the art methods, our proposed framework achieves significant boost up in processing speed, while having very competitive performance compared to the best performing methods on the widely used datasets.
READ FULL TEXT VIEW PDFMeta Learning with Differentiable Closed-form Solver for Fast Video Object Segmentation, IROS2020
Fast and accurate video object segmentation plays an important role in many real-world applications, including, but not limited to, film making [12], public surveillance [44], robotic vision [20].
The goal of video object segmentation is to distinguish an object of interest over video frames from its background at the pixel level.
In contrast to many vision tasks such as image classification [17]
[28] and object detection [32, 13] which the performance of the algorithms reach to the point of being suitable for real-world applications, the performance of video object segmentation algorithms are still far beyond the annotations performed by human [30]. This is mainly because this problem does not benefit from availability of a massive corpus of training data, unlike the other aforementioned tasks.Recently, deep learning-based approaches have shown promising progresses on video object segmentation task
[3, 42, 24, 37]. However, they still struggle to satisfy both good accuracy and fast processing inference. In this paper, we aim to bridge this gap.Inspired by the meta-learning method of [2]
, we propose an intuitive yet powerful algorithm for video object segmentation, in which the reference frame is available with its annotated mask. Our objective is to train a system that can “adapt” this annotation information to subsequent frames in a fast yet flexible way at inference time. Specifically, at inference time the reference frame (i.e. one with ground-truth annotation) is mapped to vector in a high dimensional embedding space
using a CNN . We then determine using ridge regression [25], the coefficients of a matrix that best maps to the ground truth, . is then the video-specific “adaptor”, and it maps the feature vectors for every query image (every other image in the video sequence) to their predicted segmentation masks. Training comprises the process of learning the mapping by presenting the network with pairs of images (from a variety of videos but with each pair coming from the same video), each with ground-truth annotation, and back-propagating the loss through . This is illustrated in Figure 3 and described in more detail later in the paper.We observe that a limitation of the proposed approach is that the ridge regression scales poorly with the dimension of the feature feature produced by because the optimization requires an huge matrix inversion. We address this through the use of a “block splitting” method that approximates the matrix in block diagonal form, meaning the inversion can be done much more efficiently.
Our main contributions are three-fold:
A meta-learning based method for video object segmentation is developed, using a closed form solver (ridge regression) as the internal optimizer. This is capable of performing fast gradient back-propagation and can adapt to previously unseen objects quickly with very few samples. Inference (i.e. segmentation of the video) is a single forward pass per frame with no need for fine-tuning or post-processing.
Ridge regression in high-dimensional feature spaces can be very slow, because of the need to invert a large matrix. We address this using a novel block splitting mechanism which we show greatly speeds the training process without damaging the performance.
We demonstrate state-of-the-art video segmentation accuracy relative to all others methods of comparable processing time, and even better accuracy than many slower ones (see Figure 1).
) to compute a 800D embedding tensor
. Then a mapping matrix W between and is calculated in the reference frame (Eq. 1) using ridge regression. After that, the prediction result in the query frame is acquired by multiplying and (Eq. 2). During training, the loss error between and is back-propagated to enhance the network’ adaptation ability between the reference frame and the query frame. During inference, the reference frame ( and ) is always the first frame, and the query image is the rest sequence from the same video. Through iterative meta-learned, our network is capable of quickly adapting to unseen target object(s) with a few examples.The goal of video object segmentation is to ‘cutout’ the target object(s) from the entire input video sequence. Regarding the amount of supervision utilized for video object segmentation, methods can be roughly put into two spectrum, i.e. semi-supervised and unsupervised methods.
For semi-supervised video object segmentation, the annotated mask of the first frame is given, and the algorithm is designed to predict the masks of the rest frames in the video. There are three categories in this spectrum. The first one, which include MSK [29], MPNVOS [37] etc, is to use optical flow to track the mask from the previous frame to the current frame. Similarly, the second category formulates the optical flow and segmentation in two parallel branches, and utilizes the predicted mask from the previous frame as a guidance, some representatives are Segflow [7],VSOF [40], RGMP [43], OSNM [45] etc. The final class which keeps the state-of-the-art performance on Davis benchmark [30] is to try to over-fit the appearance of the target object(s), and expect the method can generalize in the subsequent frames. Specifically, OSVOS [3] uses one-shot learning mechanism to conduct fine-tuning on the first frame of test video to capture the appearance of the target object(s), and conduct inference on the rest frames. The drawbacks of OSVOS are: (1) it can not adapt to the unseen parts (2) when dramatic changes of appearance happen in subsequent frames, the method’s performance significantly degrade. Inspired by the overall design of OSVOS, there are some following methods which employ various additional ingredients to improve the segmentation accuracy. In particular, OSVOS-S [24] combines the semantic instance information to remove the noisy objects coming from the same category. OnVOS [42] utilizes on-line adaption mechanism to overcome the limits of OSVOS when drastic appearance changes occur. CINM [1]
utilizes a CNN-based markov random field (MRF) to estimate the probabilities of the pixels belonging to the target object(s) in spatial domain, and employs optical flow to track segmented pixels in temporal domain. Albeit those methods improve the segmentation performance of OSVOS to some extent, they are still time-consuming during inference since the on-line fine-tuning is necessary. And usually, at lease a dense CRF
[16] and more techniques are applied as the post-processing step to acquire the better segmentation results.In this paper, we mainly target to fast video object segmentation, since no optical flow and fine-tuning processes are used, the proposed method is appropriate for real-world applications.
Meta learning is also named learning to learn [35, 26, 38] because its goal is to help the machine to be capable of learning quickly, especially in the case with very few samples for the new task(s). Generally speaking, meta learning algorithms are composed of two components, i.e. base learner and meta learner. According to their roles, base learner is mainly in charge of handling with individual tasks, and meta learner is much like a coordinator, through learning individuals tasks, meta learner can boost the performance of base learners across the tasks.
Meta-learning is an alternative to the de-facto solution that has emerged in deep learning of pre-training a network using a large, generic dataset (eg ImageNet
[8]) followed by fine-tuning with a problem-specific dataset. Meta-learning aims to replace the fine-tuning stage (which can still be very expensive) by training a network that has a degree of plasticity so that it can adapt rapidly to new tasks. For this reason it has become a very active area recently, especially with regard to one-shot and few-shot learning problems [18, 9]Recent approaches for meta-learning can be roughly put into three categories: (i) metric learning for acquiring similarities [41, 36, 11]; (ii) learning optimizers for gaining update rules [10, 31]; and (iii) recurrent networks for reserving the memory [33, 15]. In this work, we adopt the meta-learning algorithm that belongs to the category of learning optimizers. Specifically, inspired by [2] which was originally designed for image classification, we adopt ridge regression, which is a closed-form solution to the optimization problem. The reason for using it is because, compared with the widely-used SGD [19] in CNNs, ridge regression can propagate gradient efficiently, which is matched with the goal of fast mapping. Through extensive experiments, we demonstrate that the proposed method is in the first echelon regarding to speed for fast video object segmentation, while obtaining more accurate results without any post-processing.
A few previous methods proposed to tackle fast video object segmentation. In particular, FAVOS [6] first tracks the part-based detection. Then, based on the tracked box, it generates the part-based segments and merges those parts according to a similarity score to form the final segmentation results. The drawback of FAVOS is that it can not be learned in an end-to-end manner, and heavily relies on the part-based detection performance. OSNM [45] proposes a model which is composed of a modulator and a segmentation network. Through encoding the mask prior, the modular can help the segmentation network quickly adapt to the target object. RGMP [43] shares the same spirit with OSNM. Specifically, it employs a Siamese encoder-decoder structure to utilize the mask propagation, and further boosts the performance with synthetic data. The most similar work to ours is PML [5], which formulates the problem as a pixel-wise metric learning problem. Through the FCN [23], it maps the pixels to high-dimensional space, and utilizes a revised triplet loss to encourage pixels belonging to the same object much closer than those belonging to different objects. Nearest neighbor (NN) is required for retrieval during inference. In contrast our meta-learning approach acquires a mapping matrix between the high-dimensional feature and annotated mask in reference image using ridge regression, and then can be adapted rapidly to generate the prediction mask. Compared to baseline method PML [5], our method achieves more accurate performance and is twice the speed. And with the same efficiency, the J mean of our method is 3.4 percent better than OSNM [45] on the DAVIS2016 [30] validation set.
We formulate the video object segmentation as a meta-learning problem. For each image pair which comes from a same video, ridge regression is used as the optimizer to learn the base learner. Meta learner is naturally built through the training process. Once the meta learner is learned, it possesses the ability of fast mapping between the image features and object masks, and can be adapted to unseen objects quickly with the help of the reference image.
According to the phase that user input involved in the training loop, the current existing methods can be classified into three categories.
User input outside the network training loop This category utilizes the user input to fine-tune the network to over-fit the appearance cues of target object(s) during inference. The representatives are OSVOS [3] and its following works [24, 1, 42]. Since online fine-tuning is required during inference, the drawback of these algorithms is time-consuming, which usually take seconds per image, thus is not practical for the real-world applications.
User input within the network training loop This category of work injects the user input as the additional input for training the network. Through this way, no online fine-tuning is needed. These algorithms incorporate the user input either by using a parallel network or concatenating the image with the user input [43, 45]. One drawback of this kind of methods is that the model needs to be recalculated once the user input changes, thus it is not practical for adaptation especially for long videos.
User input is detached from the network training loop In contrast to the previous methods, our algorithm shares the same spirit with PML [5] in design. The network and user input are detached, and the user input can be more flexible. Moreover, once the user input is given (for example, the annotation in the reference image), the network can quickly adapt to the target objects without any extra operations.
For simplicity, we assume single-object segmentation case, and the annotation of first frame is given as the user input. Note that our method can also be applied for multi-objects and easily extended to other types of user input, e.g., scribble, clicks etc.
We adopt the following notation:
denotes the number of feature channels (in our case 800).
denote the spatial resolution of the extracted features (in our case 1/8th of the orginal image size).
are the feature tensors of size produced by
is a flattened tensor of or , with shape
is the flattened
tensor of annotation mask or , with shape
denotes the mapping matrix of size between the feature space and annotation mask.
As noted above, there are two components to the learner: (i) an embedding model that maps images to a high-dimensional feature space, ; and (ii) an adaptor of size , found using ridge regression, that maps the embedded features to a (flattened) segmentation mask (of size ).
Embedding Model We adopt DeeplabV2 [4] built on the ResNet-101 [14] backbone structure as our feature extractor . This choice allows a direct comparison of our method with the baseline, PML [5]. First, we use the pretrained model on COCO [22] dataset as the initialization for semantic segmentation. Then the ASPP [4] layer for classification is removed and replaced by our video-specific mapping .
Ridge Regression
Ridge regression is a closed form solver and widely-used in machine learning community
[34, 27]. The learner seeks that minimizes as follows:(1) |
where, and are as defined above, and is a regularization parameter, and set to 5.0 in all of our experiments. As can be seen in Figure 3, during training, an image pair as well as their annotations are sampled from the same video sequence. The feature extracted from the reference image (in the figure this is the first image) and its annotation will be used to calculate the mapping matrix .
(2) |
(where we abuse notation and use the unflattened feature tensors for clarity)
For the query image , likewise we compute the feature , map these to the predicted segmentation mask using Equation 2 in which is the matrix computed from the reference image and its ground truth. The loss between the prediction mask and the annotation for the query provides the back-propagation signal to improve ’s ability to produce adaptable features.
During inference in our case, the reference frame will be always the first frame, for which the annotation mask is provided, and the query frames will be the rest of frames in the same video.
Thanks to ridge regression, the computation of the mapping matrix and gradient back-propagation are already very fast compared with other algorithms, which also focus on video object segmentation.
(3) |
During the experiments, we found the higher dimension of the feature used as the input for meta-learning module, the more accurate segmentation results likely be achieved. However, we also observed that the higher dimension of the feature being utilized, the slower of the training process. Specifically, during the computation of mapping matrix W, it involves a matrix inverse calculation. as denoted by Equation 3, which will become the bottleneck of fast propagation when the very high dimensional feature is used.
In order to further speed up the training process of the proposed network, we deliver a block splitting mechanism, and its work principle as shown in Figure 4. In particular, , our motivation is that the matrix inverse computation for much high-dimensional feature (eg. 800D) can be approximated by the sum of the computations of that relative low-dimensional features (eg. 200D 4). From the work principle, it can be viewed that a matrix can be approximated by four irrelevant diagonal matrix.
The advantages of using the proposed block splitting mechanism are: Firstly, it can largely speed up the matrix inverse process involved in ridge regression, thus it saves the training time to some extent. Secondly, through the matrix approximation step as aforementioned, the network parameters involved in the ridge regression as well as memory utilized in our network are reduced. The experimental evidence can be found in Ablation Study ( Section 5).
Training Strategy For training, optimizer is SGD with momentum 0.9, with weight decay 5e-4. We use the DeepLabV2 [4] with backbone network ResNet-101 [14] as the feature extractor, and the constant learning rate, i.e. 1.0e-5, is used during the whole training process. The dimension of extracted feature is 800 outputed by the feature extractor, which is used as the input for the meta-learning module.
Loss BCEWithLogitsLoss111https://pytorch.org/docs/stable/nn.html is employed for training the proposed network, it essentially is a combination of the Sigmoid layer and binary cross entropy (BCE) loss, it benefits from the log-sum-exp trick for numerical stability. And compared to BCE loss, it is more robust and less likely to cause numerical problem when computing the inverse matrix in the ridge regression step.
(4) |
where is the batch size. is the input of the loss calculation, and () is the ground truth label. is a rescaling weight given to the loss of each batch element.
Method | DAVIS16 | Online-Tuning | OptFlow | CRF | BS | Speed(s) |
---|---|---|---|---|---|---|
OFL | 68.0 | - | ✗ | ✓ | ✗ | 42.2 |
BVS | 60.0 | - | ✗ | ✗ | ✗ | 0.37 |
ConvGRU | 70.1 | ✗ | ✓ | ✗ | ✗ | 20 |
VPN | 70.2 | ✗ | ✗ | ✗ | ✗ | 0.63 |
MaskTrack-B | 63.2 | - | ✗ | ✗ | ✗ | 0.24 |
SFL-B | 67.4 | ✗ | ✓ | ✗ | ✗ | 0.30 |
OSVOS-B | 52.5 | ✗ | ✗ | ✗ | ✗ | 0.14 |
OSNM | 72.2 | ✗ | ✗ | ✗ | ✗ | 0.14 |
PML | 75.5 | ✗ | ✗ | ✗ | ✗ | 0.28 |
Ours | 75.8 | ✗ | ✗ | ✗ | ✗ | 0.145 |
PLM | 70.0 | ✓ | ✗ | ✗ | ✗ | 0.50 |
SFL | 74.8 | ✓ | ✗ | ✗ | ✗ | 7.9 |
MaskTrack | 69.8 | ✓ | ✗ | ✗ | ✗ | 12 |
OSVOS | 79.8 | ✓ | ✓ | ✗ | ✓ | 10 |
On DAVIS2016, which contains 50 pixel-level annotated video sequences, and each video only contains one target object for segmenting. Among these 50 video sequences, 30 video sequences as the training set with which the annotated mask is provided for every frame. And another 20 video sequences as the validation set, and only the annotation of the first frame is allowed to access.
Quantitative Results Table 1 shows the experimental results on DAVIS2016 [30] on different methods. Apart from the performance (measured by J mean), switches for online-fining, using optical-flow, dense CRF (CRF) and boundary snapping (BS) are also described. Meanwhile, the inference time is also shown. In particular, compared with most of the competitors, our algorithm shares the same or much faster processing time with superior performance regarding the segmentation accuracy. Please note that, some methods which use much stronger backbone networks are not listed out for the purpose of fair comparison. Compared with OSVOS [3], for which the online fine-tuning is necessary, our method just takes a smaller fraction of time to do inference. Compared to the baseline method PML [5] which use the same feature extractor, our method is twice faster and with better performance. Compared OSNM [45], with the same efficiency, our method achieve 3.4 percent improvements regarding to the segmentation accuracy.
Qualitative Results
Figure 5 demonstrates some visualized results of our method. As shown in Figure 5, our method is not only good at recovering object details (e.g., the results on the sequence of blackswan), but also robust against heavy occlusions (eg. the results on the sequences bmx-bumps and libby, dramatic movement as well as abrupt rotation (eg. the results on the sequence motocross-bumps). However, there are very few scenarios which may lead to failure cases (denoted by the red box), and mainly caused by the (noisy) objects which have not appeared at the first frame of the video, and can be easily cured by some post-processing steps, including tracking [6], online adaptation [42, 5].
In Figure 7, we show some visualized results compared with OSVOS [3] and PML [5]. For the breakdance, scooter-black and dance-jump sequences, which contain fast moving and abrupt rotation, OSVOS [3] performs worse than PML [5]. And for the dog sequence, PML [5] can not achieve a satisfied result due to the dramatic change of the light conditions. However, on both of these two scenarios, the proposed method performs better than both of OSVOS and PML, which is benefit from robust adaptation ability of our network.
In Figure 8, some visualized results in the segTrack [39] dataset are shown. Which are acquired by direcly utlized the model trained on Davis2016 dataset. As can be seen, in most cases, our model maintain a good segmentation accuracy, and with a few case fails (as denoted by the red box), which mainly due to the dramatically changes of the light conditions and exact same appearance between the background and the target object. These results prove our method has a better generalization ability and can be quickly adapted to other unseen objects with very few examples (here, only the annotation in the first frame is provided).
Split No |
Feature | Speed | Memory | Computation Cost |
1 |
800 | 1.50 | 11590 | 640k |
2 |
400 | 1.23 | 11720 | 320k |
4 |
200 | 0.75 | 11580 | 160k |
8 |
100 | 0.86 | 11584 | 80k |
|
As mention in Section 3.3, since our meta learning module (ridge regression) requires the computation of matrix inverse, the training speed will varies significantly regrading the features with various dimensions utilized for this step. And based on the fact that low dimensional features usually have the faster speed but lose some details of image information. On the contrary, high dimensional features are time-consuming but carry much rich information. We propose a block splitting mechanism to train the meta learner. In Table 2, the splitting number (of feature), feature dimension, running speed (per iteration), memory cost (of the whole network), as well as computation cost (of the computation of matrix inverse) with different settings are listed out. As can be seen, with the feature dimension decreasing, the overall trend are running speed increasing, computation cost decreasing, dramatically. However the memory cost reduce slightly, which mainly because of the backbone feature extractor take up most of the memory usage. All the numbers are tested on the single GPU card (with type of GTX 1080).
In Figure 6, J mean of per sequence of different methods are outlined. It is sorted according our algorithm’s performance in each sub-sequence, which provides a more intuitive understanding for the proposed algorithm. Firstly, the proposed method achieve a better video segmentation accuracy when compared to many other methods. Secondly, our algorithm works quite well on most of sequences, even on the most challenging sequences, e.g., breakdance and bmx-tree, the J mean is above 0.5.Thirdly, benefit from the quick adaption ability of meta-learning, around half of sequence achieve J mean over 0.8. Moreover, our method can well recover the object details as well as robust against fast movement and heavy occlusion, which are aligned with our conclusion in Section 4.2
In this paper, we explore applying meta-learning into video object segmentation system. A closed form optimizer, i.e., ridge regression, is utilized to update the meta learner, which achieves fast speed while maintains the superior accuracy. Through iteratively meta-learned, the network is capable of conducting fast mapping
on unseen objects with a few examples available. Compared to the fine-tuning methods, our algorithm with similar performance but just a smaller fraction time is required, which is appeal to the real-world applications. In addition, a block splitting mechanism is delivered to speed up the training process, which also has the benefits of reducing parameters and saving memory. In future work, we would like to use other basic optimizers, such as, Newton’s methods and logistic regression. Meanwhile, based on the flexible design of our meta-learner, instead of inferring the rest frames from the given whole annotation of the first frame. Inferring whole object from only part of annotation or user feedback is also worth to investigate.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5977–5986, 2018.Imagenet classification with deep convolutional neural networks.
In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.