Semi-Supervised Video Object Segmentation is the task of segmenting specific objects of interest in a video sequence, given their segmentation in the first frame. This poses an interesting challenge for standard supervised methods, as the model cannot be trained to discriminate between a fixed set of classes based on semantics , but rather has to learn to segment unseen objects based on a single instance. Classic supervised techniques fail to generalize easily to new objects whose traits are potentially very different from the training data. This problem is known in the literature as “domain adaptation”.
Most of the proposed approaches cast the problem as a one-shot learning task: after a first general pre-training on the entire dataset, at inference time the generic model is adapted into an object-specific one by fine-tuning on transformations of the first frame for each test sequence [2, 15]. This procedure is computationally expensive, requiring several extra steps of back-propagation for each sequence. Furthermore, it heavily depends on the ability of the data augmentation procedure to produce realistic sequences. Indeed, generating high-quality sequences is clearly a very hard and ill-posed task that adds an extra layer of complexity to the problem.
In this work we propose a system that can adapt to the specific object of interest at inference time without the need of an expensive fine-tuning stage. This mechanism is inspired by fast-weights . In this framework a slow network generates, on-line, the weights of a second one, called fast network, to fast adapt to a new task or to a change in the environment. More recently HyperNetworks  propose to use a network to infer a transformation of the weights rather than the weights themselves. Similarly, FiLM 
proposes an extension of conditional batch-normalization where a module produces an affine transformation that is applied to the features of each layer of the main neural network in order to impose a more explicit conditioning.
In the context of object segmentation, features modulation has been investigated on the DAVIS dataset by OSMN , that extends OSVOS  by specializing a part of the architecture to condition the predictions on the target object. Specifically, this is achieved by modulating the activations of the Segmentation Network (SN) via a single forward pass of a Modulating Network (MN) that outputs a scale parameter for each channel based on the object features, and a shift parameter for each location that acts as a spatial attention mechanism similar to . Here the SN learns to perform generic segmentation and the MN fast-adapts it to attend to the object of interest.
In this paper we introduce ReConvNet, a recurrent convolutional architecture for semi-supervised video object segmentation. Our architecture self-adapts to segment unseen object without the need of extra fine-tuning steps of supervision. The main contributions of this work are the following:
We extend the successful OSMN architecture with the possibility to model highly non-linear intrinsic temporal correlations between consecutive frames via convLSTM units. This modification outperforms the baseline.
On DAVIS2016 we show comparable performance to methods that make use of online fine-tuning and we outperform them on the more challenging DAVIS2017.
We place -th in the DAVIS challenge 2018 without resorting to online fine-tuning or other post-processing steps.
We show that features modulation is orthogonal to online fine-tuning and that indeed combining the two results in a further performance boost.
ReConvNet is composed of three main components: the Segmentation Network (SN), the Visual Modulator (VM) and the Spatial Modulator (SM).
to recover details at multiple scales. Hence, OSMN processes each frame independently, often resulting in segmentation masks that lack temporal consistency and exhibit high variance across the sequence.
A natural way to incorporate temporal structure into the model is to add recurrent units. Here we use convLSTM layers, an adaptation of the original LSTM  cell that takes into account the spatial structure of its input. As proposed by , the LSTM cell can be modified by replacing the matrix multiplications of each transformation with convolution operations. The convLSTM blocks are interleaved to the last three VGG- layers to endow the network with multi-scale spatio-temporal processing capability.
The visual modulator is in charge of biasing the activations of the segmentation network to target the object of interest. This strong conditioning is achieved by a VGG- network that takes as input the first frame cropped around the target object and resized to , and produces a set
of vectors ofscaling coefficients – one for each of the last three convolutional layers of the SN. In addition to the coefficients computed by OSMN, we also compute those for the convLSTM modules. All the visual modulation coefficients are multiplied to the feature maps , where indicates channel-wise multiplication. This has the effect of enhancing the maps related to the target object and suppressing the least useful, potentially distracting ones, allowing the segmentation network to quickly adapt to the object of interest without resorting to expensive steps of back-propagation at inference time.
To help discriminating between multiple instances of the same object and, more generally, to provide a loose prior on the location of the target object, the network is also enriched with a spatial attention mechanism. A rough estimate of the position of the target object can be obtained by fitting a “gaussian blob” on the segmentation predicted at timeand fed to a Spatial Modulator component. This, in turn, produces a set of shift coefficients , one for each of the last three VGG layers, via a convolution applied on the blob downsampled to the layer’s resolution. Note that, as opposed to the VM case, we do not generate modulation coefficients for the convLSTM layers. The spatial coefficients are summed pixel-wise to the activations of the corresponding layers, therefore shifting the focus on the parts of the image where the object is more likely located. The combination of VM and SM, , is then applied to the features.
In this section we describe the experimental settings used to evaluate the ReConvNet model.
To ensure a fair comparison between ReConvNet and the OSMN baseline we initialize the components of the OSMN model in our architecture with the pretrained weights as provided by the authors. This is done on both DAVIS2016 and DAVIS2017, to ensure that any improvement can be clearly attributed to the introduction of a recurrent architecture.
For the initialization of the remaining modules, in an effort to minimize the factors of variations with respect to the baseline, the extra channels of the visual modulator are initialized as in  and, similarly, the input-to-hidden convolutions in the convLSTM layers use the same initialization as the convolutional layers in the baseline. Lastly, the hidden-to-hidden convolutions are initialized to be orthogonal .
The model is trained with a lower learning rate for the non-recurrent than for the recurrent component, namely and respectively. We found it beneficial to train with the Lovasz loss  that directly optimizes the IoU measure. Finally, to prevent overfitting we employed early-stopping and data augmentation  of the inputs of the visual and spatial modulators with random shift, scale, and rotation transformations. When online fine-tuning is used, the model is trained on each test sequence with random transformation of the first frame for iterations and learning rate for all components.
In order to make use of the relative abundance of segmented static images for pre-training, we split the training process in two phases. First, we train the non-recurrent components of the model on MSCOCO  to learn segmentation coupled with modulation, then we train the full network on DAVIS2017 to account for the spatio-temporal recurrent component as well as to make use of the modulation to focus on the target object throughout the sequence.
The single frame pre-training procedure proved to be an essential proxy to bootstrap the temporal-consistent segmentation. In fact, DAVIS contains only a few examples for most semantic classes, making it very easy for the network to overfit on the training examples.
As it can be expected, the problem is exacerbated by models with high capacity, and even more by those that exploit a visual modulator to tackle semi-supervised segmentation. As shown in  the set of parameters , produced by the visual modulator, pushes the model to learn a semantic mapping in an embedding space where visually similar objects are close in distance. Learning this mapping requires a large enough amount of diverse examples. We choose the MSCOCO  dataset for its wide range of classes and intra-class variations.
|OSMN (2nd) ||✗||-||74.0||-||-||-||-||-||54.8||52.5||60.9||21.5||57.1||66.1||24.3|
3.1 Single Object Segmentation
We first evaluate our model on DAVIS 2016, that focuses on single objects. This is a hard task that allows us to validate the model and to compare with the OSMN baseline. As shown in Table 1, thanks to the combination of spatio-temporal consistency given by the convLSTM units and their features modulation, ReConvNet outperforms OSMN by points on the mean IoU (-mean) metric and is -th in the leaderboard of the semi-supervised approaches111https://davischallenge.org/davis2016/soa_compare.html when comparing on the average between the and scores.
It is important to highlight that OSVOS , OSVOS-S  and onAVOS  perform online fine-tuning on the first frame of the video sequence at inference time. Moreover, OSVOS utilizes a boundary snapping approach, onAVOS makes use of a CRF post-processing step, and OSVOS-S incorporates instance-aware semantic information from a state-of-the-art instance segmentation method to further improve the performance.
Most of these methods introduce expensive computation steps at inference time that are normally not needed when resorting to features modulation. Nothing prevents though to pair this technique with online fine-tuning or CRF post-processing to improve performance. Indeed, with a few steps of fine-tuning at inference time ReConvNet gains points on the &-mean placing itself points below onAVOS, which scored 2nd in the public leaderboard.
3.2 Multiple Objects Segmentation
The most recent version of DAVIS introduces the challenging task of multiple objects segmentation. On this dataset ReConvNet has been trained by feeding the visual modulator with one randomly picked object from the scene at a time and using the segmentation of the same object in the current frame as target. Table 1 shows that ReConvNet adapts very well to the multiobject task outperforming the baseline OSMN by points on the &-mean metric. Remarkably, our method also outperforms the state-of-the-art OSVOS and onAVOS by and points, respectively, without the need of expensive online fine-tuning. Introducing online fine-tuning, the &-mean improves by , that is points more than OSVOS-S, the current state of the art in the public leaderboard on the DAVIS2017 validation set.
DAVIS Challenge 2018
We participated to the DAVIS Challenge 2018 retraining on the training set augmented with the validation set. Our preliminary evaluation on the test-dev set scored and on &-mean without and with online fine-tuning respectively, placing us -th in the test-dev public leaderboard.
On the test-challenge set ReConvNet scored &-mean, and and -mean and -mean, respectively, ranking -th in the final DAVIS Challenge 2018 evaluation. This is an encouraging result considering that no online fine-tuning was employed: by adding gradient steps at inference time it is reasonable to expect a performance boost similar to the one consistently witnessed in the previous experiments.
4 Conclusions and Future work
We presented ReConvNet, a powerful and efficient recurrent convolutional model to perform semi-supervised video object segmentation. The model is able to learn spatio-temporal features that self-adapt to focus on the object of interest without the need of extra fine-tuning at inference time. ReConvNet outperforms the baseline by a considerable margin, proving the effectiveness of incorporating temporal consistency into the model. Our results reinforce the conjecture that features modulation is a valid approach to semi-supervised video object segmentation. We plan to perform a more in-depth analysis of the interaction between the temporal components and the features modulation, since we believe it is crucial to better understand the potential of the proposed model.
We thank Jürgen Schmidhuber and Imanol Schlag for helpful discussions on fast weights and Razvan Pascanu for insightful comments on the model. We are also grateful to AGS SpA for providing the NVIDIA 1080Ti machine to run all the experiments. Finally, our thoughts go to Aaron, Adriana and Michal, for giving the initial thrust to this work.
-  M. Berman, A. R. Triki, and M. B. Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. arXiv preprint arXiv:1705.08790, 2017.
-  S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. One-shot video object segmentation. In CVPR, 2017.
-  D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
-  B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In CVPR, pages 447–456, 2015.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool. Video object segmentation without temporal information. TPAMI, 2018.
-  E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville. Film: Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.
-  A. M. Saxe, J. L. McClelland, and S. Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120, 2013.
-  J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In Advances in Neural Information Processing Systems 27. 2014.
F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, M. Matteucci,
and A. Courville.
Reseg: A recurrent neural network-based model for semantic segmentation.In CVPR Workshops, June 2016.
-  F. Visin and A. Romero. Dataset loaders: a python library to load and preprocess datasets. https://github.com/fvisin/dataset_loaders, 2017.
P. Voigtlaender and B. Leibe.
Online adaptation of convolutional neural networks for video object segmentation.In BMVC, 2017.
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo.
Convolutional lstm network: A machine learning approach for precipitation nowcasting.In NIPS, 2015.
-  L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos. Efficient video object segmentation via network modulation. arXiv preprint arXiv:1802.01218, 2018.