One-shot Video Object Segmentation (VOS) is the task of segmenting an object of interest throughout a video sequence with many applications in areas such as autonomous systems and robotics. In this task, the first mask of the object appearance is provided and the model is supposed to segment that specific object during the rest of the sequence. VOS is a fundamental task in Computer Vision dealing with various challenges such as handling occlusion, tracking objects with different sizes and speed, and drastic motion either from the camera or the object[yao2019video]. Within the last few years, video object segmentation has received a lot of attention from the community [caelles2017one, perazzi2017learning, voigtlaender2017online, tokmakov2017learning]. Although VOS has a long history [chang2013video, grundmann2010efficient, marki2016bilateral], only recently it has resurfaced due to the release of large-scale and specialized datasets [pont20172017, perazzi2016benchmark].
To solve VOS, a wide variety of approaches have been proposed in the literature ranging from training with static images without using temporal information [caelles2017one] to using optical flow for utilizing the motion information and achieving better temporal consistency [tokmakov2017learning]. However, the methods relying solely on static images lack temporal consistency and using optical flow is computationally expensive and imposes additional constraints.
With the release of YoutubeVOS [xu2018youtub], the largest video object segmentation dataset to date, the authors demonstrated that having enough labeled data makes it possible to train a sequence-to-sequence (S2S) model for video object segmentation. In S2S, an encoder-decoder architecture is used similar to [badrinarayanan2017segnet]
. Furthermore, a recurrent neural network (RNN) is employed after the encoder (referred to as bottleneck) to track the object of interest in a temporally coherent manner.
In this work, we build on top of the S2S architecture due to its simplicity and elegant design that reaches impressive results compared to the state of the art while remaining efficient [xu2018youtub]. In the YoutubeVOS dataset, there are sequences with up to five objects with various sizes to be segmented. By having a close look at the S2S behavior, we noticed that it often loses track of smaller objects. The problem in the failure cases is that early in the sequence, the network prediction of the segmentation masks has a lower confidence (the output of the layer is around ). This uncertainty increases and propagates rapidly to the next frames resulting in the model losing the object as shown in Figure 1. Therefore, the segmentation score of that object would be zero for the rest of the sequence which has a strong negative impact on the overall performance. We argue that this is partly due to lack of information in the bottleneck, especially for small objects.
To improve the capacity of the model for segmenting smaller objects, we propose utilizing spatio-temporal information at multiple scales. To this end, we propose using additional skip connections incorporating a memory module (henceforth referred to as skip-memory). Our intuition is based on the role of ConvLSTM in the architecture that is remembering the area of interest in the image. Using skip-memory allows the model to track the target object at multiple scales. This way, even if the object is lost at the bottleneck, there is still a chance to track it by using information from lower scales.
Our next contribution is the introduction of an auxiliary task for improving the performance of video object segmentation. The effectiveness of multi-task learning has been shown in different scenarios [ruder2017overview], but it has received less attention in video object segmentation. We borrow ideas from Bischke et al. [bischke2019multi] for satellite image segmentation and adapt it for the task at hand. The auxiliary task defined here is distance classification. For this purpose, border classes regarding the distance to the edge of the target object is assigned to each pixel in the ground-truth mask.
We adapt the decoder network with an extra branch for the additional distance classification mask and use its output as an additional training signal for predicting the segmentation mask. The distance classification objective provides more specific information about the precise location of each pixel, resulting in significant improvement of the score (measuring the quality of the segmentation boundaries). The overall architecture is shown in Figure 2. In the rest of the paper we refer to our method as S2S++.
Ii Related Work
One-shot video object segmentation can be seen as pixel-wise tracking of target objects throughout a video sequence, where the first segmentation mask is provided as shown in LABEL:fig:intro. This field has a long history in the literature [brox2010object]
, however, with the rise of deep learning, classical methods based on energy minimization, using super-voxels and graph-based methods[papazoglou2013fast, jain2014supervoxel, shankar2015video, faktor2014video] were replaced with deep learning based approaches. In the following we provide a brief overview of the state of the art approaches in this domain.
Having the first segmentation mask at hand, two training mechanisms exist in the literature: offline training which is the standard training process and online training which is performed at the test time. In online training, heavy augmentation is applied to the first frame of the sequence in order to generate more data and the network is further trained on the specific sequence [perazzi2017learning, caelles2017one, voigtlaender2017online, Man+18b]. Using online training as an additional training step leads to better results, however, it makes the inference phase quite slow and computationally more demanding.
Regarding offline training, various approaches have been suggested in the literature. Some approaches are based on using static images and extending image segmentation for video [caelles2017one, perazzi2017learning]. In [caelles2017one], authors use a VGG architecture [simonyan2014very]
pre-trained on ImageNet[krizhevsky2012imagenet]
and adapt it for video object segmentation. Further offline and online training accompanied by contour snapping, allow the model to keep the object of interest and discard the rest of the image (classified as background).[perazzi2017learning] treats the task as guided instance segmentation. In this case, the previous predicted mask (first mask in the beginning followed by using predicted masks at next time steps) is used as an additional input, serving as the guidance signal. Moreover, the authors experiment with different types of signals such as bounding boxes and optical flow demonstrating that even a weak guidance signal such as bounding box can be effective. In OSMN [yang2018efficient], the authors propose using two visual and spatial modulator networks to adapt the base network for segmenting only the object of interest. The main problem with these methods is that they do not utilize sequential data and therefore suffer from a lack of temporal consistency.
Another approach taken in the literature is using object proposals based on RCNN-based techniques [he2017mask]. In [luiten2018premvos] the task is divided into two steps: First, generating the object proposals and second, selecting and fusing the promising mask proposals trying to enforce the temporal consistency by utilizing optical flow. In [li2017video] the authors incorporate a re-identification module base on patch-matching to recover from failure cases where the object is lost during segmentation (e.g., as a result of accumulated error and drift in long sequences). These methods achieve a good performance, with the downside of being quite complex and slow.
Before the introduction of a comprehensive dataset for VOS, it was customary to pre-train the model parameters on image segmentation datasets such as PASCAL VOC [everingham2010pascal] and then fine-tune them on video datasets [voigtlaender2017online, wug2018fast]. Khoreva et al. suggest an advanced data augmentation method for video segmentation including non-rigid motion, to address the lack of labeled data in this domain [khoreva2019lucid]. However, with the release of YoutubeVOS dataset [xu2018youtub], the authors show that it is possible to train an end-to-end, sequence-to-sequence model for video object segmentation when having enough labeled data. They deploy a ConvLSTM module [xingjian2015convolutional] to process the sequential data and to maintain temporal consistency.
In [tokmakov2017learning], the authors propose a two-stream architecture composed of an appearance network and a motion network. The result of these two branches are merged and fed to a ConvGRU module before the final segmentation. [ventura2019rvos] extends the spatial recurrence proposed for image instance segmentation [salvador2017recurrent] with temporal recurrence, designing an architecture for zero-shot video object segmentation (without using the first ground-truth mask).
In this paper, we focus on using ConvLSTM for processing sequential information at multiple scales, following the ideas in [xu2018youtub]. In the next sections, we elaborate on our method and proceed with the implementation details followed by experiments as well as an ablation study on the impact of different components and the choice of hyper-parameters.
In this section we describe our method including the modifications to the S2S architecture and the use of our multi-task loss for video object segmentation. The S2S model is illustrated with yellow blocks in Figure 2 where the segmentation mask is computed as (adapted from [xu2018youtub]):
with referring to the RGB image and to the binary mask.
Iii-a Integrating Skip-Memory Connections
We aim to better understand the role of the memory module used in the center of the encoder-decoder architecture in S2S method. To this end, we replaced the ConvLSTM with simply feeding the previous mask as guidance for predicting the next mask, similar to [perazzi2017learning]. Doing so, we observed a drastic performance drop of about ten percent in the overall segmentation accuracy. This suggests that only the guidance signal from the previous segmentation mask is not enough and that features from the previous time step should be aligned to the current time step. As a result, we hypothesise that the role of ConvLSTM in the architecture is twofold: First, to remember the object of interest through the recurrent connections and the hidden state, and to mask out the rest of the scene and second, to align the features from the previous step to the current step, having a role similar to optical flow.
As mentioned earlier, the S2S model incorporates a memory module at the bottleneck of the encoder-decoder network. By inspecting the output of this approach, we noticed that the predicted masks for small objects are often worse than the other objects (see Figure 4 and Figure 5 for visual examples). The issue is that the target object often gets lost early in the sequence as shown in Figure 1. We reason that this is partially due to the lack of information for smaller objects in the bottleneck. For image segmentation, this issue is addressed via introducing skip connections between the encoder and the decoder [ronneberger2015u, badrinarayanan2017segnet]. This way the information about small objects and fine details are directly passed to the decoder. Using skip connections is very effective in image segmentation; however, when working with video, if the information in the bottleneck (input to the memory) is lost, the memory concludes that there is no object of interest in the scene anymore (since the memory provides information about the target object and its location). As a result, the information in the simple skip connections will not be very helpful in this failure mode.
As a solution, we propose a system that keeps track of features at different scales of the spatio-temporal data by using a ConvLSTM in the skip connection as shown in Figure 2. We note that some technical considerations should be taken into account when employing ConvLSTM at higher image resolutions. As we move to higher resolutions (lower scales) in the video, the motion is larger and also the receptive field of the memory is smaller. As stated in [reda2018sdc], capturing the displacement is limited to the kernel size in kernel-based methods such as using ConvLSTMs. Therefore, adding ConvLSTMs at lower scales in the decoder, without paying attention to this aspect might have negative impact on the segmentation accuracy. Moreover, during our experiments we observed that it is important to keep the simple skip connections (without ConvLSTM) intact in order to preserve the uninterrupted flow of the gradients. Therefore, we add the ConvLSTM in an additional skip connection (RNN2 in Figure 2) and merge the information from different branches using weighted averaging with learnable weights. Hence, it is possible for the network to access information from different branches in an optimal way.
For the training objective of the segmentation branch in Figure 2, we use the sum of the balanced binary-cross-entropy loss [caelles2017one] over the sequence of length , defined as:
where is the input, W is the learned weights, and are foreground and background labeled pixels, , and is the total number of pixels.
Iii-B Border Distance Mask and Multi-Task Objective
As the second extension, we build upon previous work of Bischke et al. [bischke2019multi] and train the network parameters in addition to the object segmentation mask with an image representation based on a distance transformation (see Figure 3 for an example). This image representation was successfully used in a multi-task learning setup to explicitly bias the model to focus more on those pixels which are close to the object boundary and more error prone for misclassification, compared to the ones further away from the edge of the object.
In order to derive this representation, we first apply the distance transform to the object segmentation mask. We truncate the distance at a given threshold to only incorporate the nearest pixels to the border. Let denote the set of pixels on the object boundary and the set of pixels belonging to the object mask. For every pixel we compute the truncated distance as:
where is the Euclidean distance between pixels and and is the maximal radius (truncation threshold). The pixel distances are additionally weighted by the sign function to represent whether pixels lie inside or outside objects. The continuous distance values are then uniformly quantized with a bin-size into bins. Considering both inside and outside border pixels, this yields to
binned distance classes as well as two classes for pixel distances that exceeds the threshold R. We one-hot encode every pixelof this image representation into classification maps corresponding each of the border distance.
We optimize the parameters of the network with a multi-task objective by combining the loss for the segmentation mask and the loss for the border distance mask as a weighted sum as follows. Since we consider a multi-class classification problem for the distance prediction task we use the cross-entropy loss. is defined as the cross entropy loss between the derived distance output representation and the network output:
The loss of the object segmentation task is the balanced binary-cross-entropy loss as defined in Equation 5. The network can be trained end-to-end.
Iv Implementation Details
In this section we describe implementation details of our approach.
Iv-a Initializer and Encoder Networks
The backbone of the initializer and the encoder networks in Figure 2 is a VGG16 [simonyan2014very] pre-trained on ImageNet [krizhevsky2012imagenet]. The last layer of VGG is removed and the fully-connected layers are adapted to a convolution layer to form a fully convolutional architecture as suggested in [long2015fully]. The number of input channels for the initializer network is changed to , as it receives the RGB and the binary mask of the object as the input. The initializer network has two additional convolution layers with channels to generate the initial hidden and cell states of the ConvLSTM at the bottleneck (RNN1 in Figure 2). For initializing the ConvLSTMs at higher scales, up-sampling followed by convolution layers are utilized, with the same fashion as the decoder. Additional convolution layers are initialized with Xavier initialization [glorot2010understanding].
The ConvLSTM (shown as RNN1 and RNN2 in Figure 2) both have a kernel size of with channels. The ConvLSTM at the next level has a kernel size of with channels. Here, we chose a bigger kernel size to account for capturing larger displacements at lower scales in the image pyramid.
The decoder consists of five up-sampling layers with bi-linear interpolation, each followed by a convolution layer with kernel size ofand Xavier initialization [glorot2010understanding]. the number of channels for the convolution layers are respectively. The features from the skip connections and the skip-memory are merged using a convolution layer. To adapt the decoder for the multi-task loss, an additional convolution layer is used to map channels to the number of distance classes. This layer is followed by a to generate the distance class probabilities. The distance scores are merged into the segmentation branch where a layer is used to generate the binary segmentation masks.
We use the Adam optimizer [kingma2014adam] with an initial learning rate of . In our experiments we set the value of in Equation 7 to . When the training loss is stabilized, we decrease the learning rate by a factor of every epochs. Due to GPU memory limitations, we train our model with batch size and a sequence length of to frames.
|base model + multi-task loss||67.65||44.62||70.81||49.84||58.23|
|base model + one skip-memory||66.89||46.82||69.22||50.08||58.25|
|base model + one skip-memory + multi-task loss||67.18||47.04||70.24||52.30||59.19|
|base model + two skip-memory + multi-task loss||68.68||48.89||72.03||54.42||61.00|
|border size||bin size||number of classes||overall|
Iv-D Border Output Representation
The number of border pixels and the bin size per class are hyper-parameters which determine the resulting number of distance classes. In our internal experiments (see Section V-B), we noticed better results can be achieved if the number of distance classes is increased. In the following experiments, we set the border_pixels=20, the bin_size=1. Thereby we obtain for each object a segmentation mask with distance classes (the number of output classes is ). Having the edge as the center, we have classes at each of the inner and outer borders plus two additional classes for pixels which do not lie within the borders (inside and outside of the object) as shown in Figure 3.
Iv-E Data Pre- and Post-Processing
In line with the previous work in multiple object video segmentation, we follow a training pipeline, in which every object is tracked independently and at the end the binary masks from different objects are merged into a single mask. For pixels with overlapping predictions, the label from the object with the highest probability is taken into account. For data loading during the training phase, each batch consists of a single object from a different sequence. We noticed that processing multiple objects of the same sequence degrades the performance. The images and the masks are resized to as suggested in [xu2018youtub]. For data augmentation we use random horizontal flipping and affine transformations. For the results provided in Section V, we have not used any refinement setp (e.g. CRF [krahenbuhl2011efficient]) or inference-time augmentation. Moreover, we note that pre-training on image segmentation datasets can greatly improve the results due to the variety of present object categories in these datasets. However, in this work we have solely relied on pre-trained weights from ImageNet [krizhevsky2012imagenet].
V Experiments and Results
In this section a comparison with the state-of-the-art methods is provided in Table I. Additionally, we perform an ablation study in Table II to examine the impact of skip-memory and multi-task loss in our approach.
We evaluate our method on YoutubeVOS dataset [xu2018youtub]
which is currently the largest dataset for video object segmentation. We use the standard evaluation metrics[perazzi2016benchmark], reporting Region Similarity and Contour Accuracy (). corresponds to the average intersection over union between the predicted segmentation masks and the ground-truth, and is defined as , regarding the boundary pixels after applying sufficient dilation to the object edges. For an overall comparability, we use the overall metric of the dataset [xu2018youtub] that refers to the average of scores.
V-a Comparison to state-of-the-art approaches
In Table I, we provide a comparison to state of art methods with and without online training. As mentioned in Section II, online training is the process of further training at test time through applying a lot of data augmentation on the first mask to generate more data. This phase greatly improves the performance, at the expense of slowing down the inference phase. As it can be seen in Table I, the scores are measured for two categories of seen and unseen objects. This is a difference between other datasets and YoutubeVOS [xu2018youtub] which consists of new object categories in the validation set. Specifically, the validation set in YoutubeVOS dataset includes videos with 65 seen and 26 unseen categories. The score for unseen categories serves as a measure of generalization of different models. As expected, the unseen object categories achieve a higher score when using online training (since the object is already seen by the network during the online training). However, despite not using online training (and therefore also having lower computational demands during test time), S2S and S2S++ achieve higher overall performance. It is worth mentioning, that both and scores improve by more than 4 percentage points in our approach.
V-B Ablation Study
Since the S2S method is the base of our work and the source code is not available, we provide a comparison between our implementation of S2S and S2S++ in Table II, when adding each component. As it can be seen from the results, the best performance is achieved when using two skip-memory modules and multi-task loss. We then take this model and experiment with different hyper-parameters for multi-task loss, as shown in Table III. The results show that a higher number of border classes that is closer to regression yields to a higher overall score.
It is worth mentioning that the distance loss () has less impact for small objects, especially if the diameter of the object is below the border size (in this case no extra distance classes will be added). Hence, we suspect the improvement in segmenting small objects (shown in Figure 4 and Figure 5) is mainly due to the use of skip-memory connections.
In this work we observed that the S2S method often fails when segmenting small objects. We build on top of this approach and propose using skip-memory connections for utilizing multi-scale spatio-temporal information of the video data. Moreover, we incorporate a distance-based multi-task loss to improve the predicted object masks for video object segmentation. In our experiments, we demonstrate that this approach outperforms state of the art methods on the YoutubeVOS dataset [xu2018youtub]. Our extensions to the S2S model require minimal changes to the architecture and greatly improves the contour accuracy score (F) and the overall metric.
One of the limitations of the current model is a performance drop for longer sequences, especially in the presence of multiple objects in the scene. In future, we would like to study this aspect and investigate the effectiveness of using attention as a potential remedy. Furthermore, we would like to study the multi task loss in more. One interesting direction is to learn separate task weights for the segmentation and distance prediction task as in [bischke2019multi] rather than using fixed task weights as in our work. In this context, we would also like to examine the usage of a regression task rather than classification task for predicting the distance to the object border.
This work was supported by the TU Kaiserslautern CS PhD scholarship program, the BMBF project DeFuseNN (Grant 01IW17002) and the NVIDIA AI Lab (NVAIL) program. Further, we thank all members of the Deep Learning Competence Center at the DFKI for their feedback and support.