1 Introduction
Video understanding remains largely unsolved despite significant improvements in image understanding over the past few years. The accuracy of current image classification and semantic segmentation models is not yet matched in action recognition and video segmentation, to some extent due to the lack of largescale benchmarks, but also due to the complexity introduced by the time variable. Combined with the increase in memory and computation demands, video understanding poses additional challenges that call for novel methods.
Our objective in this work is to couple the decisions taken by a neural network in time, in a manner that allows information to flow across frames and thereby result in decisions that are consistent both spatially and temporally. Towards this goal we pursue a structured prediction approach, where the structure of the output space is exploited in order to train classifiers of higher accuracy. For this we introduce VideoGCRF, an extension into video segmentation of the Deep Gaussian Random Field (DGRF) technique recently proposed for singleframe structured prediction in
[6, 7].We show that our algorithm can be used for a variety of video segmentation tasks: semantic segmentation (CamVid dataset), instance tracking (DAVIS dataset), and a combination of instance segmentation with MaskRCNNstyle object detection, customized in particular for the person class (DAVIS Person dataset).
Our work inherits all favorable properties of the DGRF method: in particular, our method has the advantage of delivering (a) exact inference results through the solution of a linear system, rather than relying on approximate meanfield inference, as [25, 26], (b) allowing for exact computation of the gradient during backpropagation, thereby alleviating the need for the memorydemanding backpropagationthroughtime used in [42] (c) making it possible to use nonparametric terms for the pairwise term, rather than confining ourselves to pairwise terms of a predetermined form, as [25, 26], and (d) facilitating inference on both densely and sparselyconnected graphs, as well as facilitating blends of both graph topologies.
Within the literature on spatiotemporal structured prediction, the work that is closest in spirit to ours is the work of [26] on Feature Space Optimization. Even though our works share several conceptual similarities, our method is entirely different at the technical level. In our case spatiotemporal inference is implemented as a structured, ‘lateral connection’ layer that is trained jointly with the feedforward CNNs, while the method of [26] is applied at a postprocessing stage to refine a classifier’s results.
1.1 Previous work
Structured prediction is commonly used by semantic segmentation algorithms [6, 7, 8, 10, 11, 36, 39, 42]
to capture spatial constraints within an image frame. These approaches may be extended naively to videos, by making predictions individually for each frame. However, in doing so, we ignore the temporal context, thereby ignoring the tendency of consecutive video frames to be similar to each other. To address this shortcoming, a number of deep learning methods employ some kind of structured prediction strategy to ensure temporal coherence in the predictions. Initial attempts to capture spatiotemporal context involved designing deep learning architectures
[22]that implicitly learn interactions between consecutive image frames. A number of subsequent approaches used Recurrent Neural Networks (RNNs)
[1, 13] to capture interdependencies between the image frames. Other approaches have exploited optical flow computed from state of the art approaches [17] as additional input to the network [15, 18]. Finally, [26] explicitly capture temporal constraints via pairwise terms over probabilistic graphical models, but operate posthoc, i.e. are not trained jointly with the underlying network.In this work, we focus on three problems, namely (i) semantic and (ii) instance video segmentation as well as (iii) semantic instance tracking. Semantic instance tracking refers to the problem where we are given the ground truth for the first frame of a video, and the goal is to predict these instance masks on the subsequent video frames. The first set of approaches to address this task start with a deep network pretrained for image classification on large datasets such as Imagenet or COCO, and finetune it on the first frame of the video with labeled ground truth
[5, 38], optionally leveraging a variety of data augmentation regimes [24] to increase robustness to scale/pose variation and occlusion/truncation in the subsequent frames of the video. The second set of approaches poses this problem as a warping problem [31], where the goal is to warp the segmentation of the first frame using the images and optical flow as additional inputs [19, 24, 27].A number of approaches have attempted to exploit temporal information to improve over static image segmentation approaches for video segmentation. Clockwork convnets [34] were introduced to exploit the persistence of features across time and schedule the processing of some layers at different update rates according to their semantic stability. Similar feature flow propagation ideas were employed in [26, 43]. In [29]
segmentations are warped using the flow and spatial transformer networks. Rather than using optical flow, the prediction of future segmentations
[21] may also temporally smooth results obtained framebyframe. Finally, the stateoftheart on this task [15] improves over PSPnet[41] by warping the feature maps of a static segmentation CNN to emulate a video segmentation network.2 VideoGCRF
In this work we introduce VideoGCRF, extending the Deep Gaussian CRF approach introduced in [6, 7] to operate efficiently for video segmentation. Introducing a CRF allows us to couple the decisions between sets of variables that should be influencing each other; spatial connections were already explored in [6, 7] and can be understood as propagating information from distinctive image positions (e.g. the face of a person) to more ambiguous regions (e.g. the person’s clothes). In this work we also introduce temporal connections to integrate information over time, allowing us for instance to correctly segment frames where the object is not clearly visible by propagating information from different time frames.
We consider that the input to our system is a video containing frames. We denote our network’s prediction as , where at any frame the prediction
provides a realvalued vector of scores for the
classes for each of the image patches; for brevity, we denote by the number of prediction variables. The scores corresponding to a patch can be understood as inputs to a softmax function that yields the label posteriors.The GaussianCRF (or, GCRF) model defines a joint posterior distribution through a Gaussian multivariate density for a video as:
where , denote the ‘unary’ and ‘pairwise’ terms respectively, with and . In the rest of this work we assume that depend on the input video and we omit the conditioning on for convenience.
What is particular about the GCRF is that, assuming the matrix of pairwise terms is positivedefinite, the MaximumAPosterior (MAP) inference merely amounts to solving the system of linear equations . In fact, as in [6], we can drop the probabilistic formulation and treat the GCRF as a structured prediction module that is part of a deep network. In the forward pass, the unary and the pairwise terms and , delivered by a feedforward CNN described in Sec. 2.1 are fed to the GCRF module which performs inference to recover the prediction x by solving a system of linear equations given by
(1) 
where is a small positive constant added to the diagonal entries of to make it positive definite.
For the singleframe case () the iterative conjugate gradient [35] algorithm was used to rapidly solve the resulting system for both sparse [6] and fully connected [7] graphs; in particular the speed of the resulting inference is in the order of 30ms on the GPU, almost two orders of magnitude faster than the implementation of DenseCRF [25], while at the same time giving more accurate results.
Our first contribution in this work consists in designing the structure of the matrix so that the resulting system solution remains manageable as the number of frames increases. Once we describe how we structure , we then will turn to learning our network in an endtoend manner.
2.1 Spatiotemporal connections
In order to capture the spatiotemporal context, we are interested in capturing two kinds of pairwise interactions: (a) pairwise terms between patches in the same frame and (b) pairwise terms between patches in different frames.
Denoting the spatial pairwise terms at frame by and the temporal pairwise terms between frames as we can rewrite Eq. 1 as follows:
(2) 
where we group the variables by frames. Solving this system allows us to couple predictions across all video frames , positions, and labels . If furthermore and then the resulting system is positive definite for any positive .
We now describe how the pairwise terms are constructed through our CNN, and then discuss acceleration of the linear system in Eq. 2 by exploiting its structure.
Spatial Connections: We define the spatial pairwise terms in terms of inner products of pixelwise embeddings, as in [7]. At frame we couple the scores for a pair of patches taking the labels respectively as follows:
(3) 
where and , , and is the embedding associated to point . In Eq. 3 the terms are imagedependent and delivered by a fullyconvolutional “embedding” branch that feeds from the same CNN backbone architecture, and is denoted by in Fig. 2.
The implication of this form is that we can afford inference with a fullyconnected graph. In particular the rank of the block matrix , equals the embedding dimension , which means that both the memory and time complexity of solving the linear system drops from to , which can be several orders of magnitude smaller. Thus,
Temporal Connections: Turning to the temporal pairwise terms, we couple patches coming from different frames taking the labels respectively as
(4) 
where . The respective embedding terms are delivered by a branch of the network that is separate, temporal embedding network denoted by in Fig. 2.
In short, both the spatial pairwise and the temporal pairwise terms are composed as Gram matrices of spatial and temporal embeddings as , and . We visualize our spatiotemporal pairwise terms in Fig. 3.
VideoGCRF in Deep Learning: Our proposed spatiotemporal Gaussian CRF (VideoGCRF) can be viewed as generic deep learning modules for spatiotemporal structured prediction, and as such can be plugged in at any stage of a deep learning pipeline: either as the last layer, i.e. classifier, as in our semantic segmentation experiments (Sec. 3.3), or even in the lowlevel feature learning stage, as in our instance segmentation experiments (Sec. 3.1).
2.2 Efficient ConjugateGradient Implementation
We now describe an efficient implementation of the conjugate gradient method [35], described in Algorithm 1 that is customized for our VideoGCRFs.
The computational complexity of the conjugate gradient algorithm is determined by the computation of the matrixvector product , corresponding to line :7 of Algorithm 1 (we drop the subscript for convenience).
We now discuss how to efficiently compute in a manner that is customized for this work. In our case, the matrixvector product is expressed in terms of the spatial () and temporal () embeddings as follows:
(5) 
From Eq. 5, we can express as follows:
(6) 
One optimization that we exploit in computing efficiently is that we do not ‘explicitly’ compute the matrixmatrix products or . We note that can be decomposed into two matrixvector products as , where the expression in the brackets is evaluated first and yields a vector, which can then be multiplied with the matrix outside the brackets. This simplification alleviates the need to keep terms in memory, and is computationally cheaper.
Further, from Eq. 6, we note that computation of requires the matrixvector product . A blackbox implementation would therefore involve redundant computations, which we eliminate by rewriting Eq. 6 as:
(7) 
This rephrasing allows us to precompute and cache , thereby eliminating redundant calculations.
While so far we have assumed dense connections between the image frames, if we have sparse temporal connections (Sec. 3.1), i.e. each frame is connected to a subset of neighbouring frames in the temporal domain, the linear system matrix is sparse, and is written as
(8) 
where denotes the temporal neighbourhood of frame . For very sparse connections caching may not be necessary because these involve little or no redundant computations.
2.3 Backward Pass
Since we rely on the Gaussian CRF we can get the backpropagation equation for the gradient of the loss with respect to the unary terms, , and the spatial/temporal embedding terms in closed form. Thanks to this we do not have to perform backpropagation in time which was needed e.g. in [42] for DenseCRF inference. Following [7], the gradients of the unary terms are obtained from the solution of the following system:
(9) 
Once these are computed, the gradients of the spatial embeddings can be computed as follows:
(10) 
while the gradients of the temporal embeddings are given by the following form:
(11) 
where is a permutation matrix, as in [7].
2.4 Implementation and Inference Time
Our implementation is GPU based and exploits fast CUDABLAS linear algebra routines. It is implemented as a module in the Caffe2 library. For spatial and temporal embeddings of size , classes (Sec. 3.3), a
input image, and network stride of
, our frame inferences take s, s and s on average respectively. Without the caching procedure described in Sec. 2.2, the frame inference takes s on average. This is orders of magnitude faster than the DenseCRF method [25] which takess on average for spatial CRF for a single input frame. These timing statistics were estimated on a
GTX1080 GPU.3 Experiments
Experimental Setup. We describe the basic setup followed for our experiments. As in [7], we use a phase training strategy for our methods. We first train the unary network without the spatiotemporal embeddings. We next train the subnetwork delivering the spatiotemporal embeddings with the softmax crossentropy loss to enforce the following objectives: , and , where are the ground truth labels for pixels
. Finally, we combine the unary and pairwise networks, and train them together in endtoend fashion. Unless otherwise stated, we use stochastic gradient descent to train our networks with a momentum of
and a weight decay of . For segmentation experiments, we use a baselearning rate of for training the unaries, for training the embeddings, and for finetuning the unary and embeddings together, using a polynomialdecay with power of . For the instance segmentation network, we use a single stage training for the unary and pairwise streams: we train the network for K iterations, with a base learning rate of which is reduced to after K iterations. The weight decay is . For our instance tracking experiments, we use unaries from [38] and do not refine them, rather use them as an input to our network. We employ horizontal flipping and scaling by factors between and during training/testing for all methods, except in the case of instance segmentation experiments (Sec. 3.1).Datasets. We use the three datasets for our experiments:
DAVIS. The DAVIS dataset [32] consists of training and validation videos containing and frames respectively. Each video comes with manually annotated segmentation masks for foreground object instances.
DAVISPerson. While the DAVIS dataset [33] provides densely annotated frames for instance segmentation, it lacks object category labels. For category prediction tasks such as semantic and instance segmentation, we create a subset of the DAVIS dataset containing videos from the category person. By means of visual inspection, we select and video sequences from the training and validation sets respectively containing training and validation images, each containing at least one person. Since the DAVIS dataset comes with only the foreground instances labeled, we manually annotate the image regions containing unannotated person instances with the donotcare label. These image regions do not participate in the training or the evaluation. We call this the DAVISperson dataset.
CamVid. The CamVid dataset [4, 3]
, is a dataset containing videos of driving scenarios for urban scene understanding. It comes with
images annotated with pixellevel category labels at fps. Although the original dataset comes with classlabels, as in [2, 26, 20], we predict semantic classes and use the trainvaltest split of , and frames respectively.3.1 Ablation Study on Semantic and Instance Segmentation Tasks
In these experiments, we use the DAVIS Person dataset described in Sec. 3. The aim here is to explore the various design choices available to us when designing networks for spatiotemporal structured prediction for semantic segmentation, and proposalbased instance segmentation tasks.
Semantic Segmentation Experiments. Our first set of experiments studies the effect of varying the sizes of the spatial and temporal embeddings, the degree of the temporal connections, and multiscale temporal connections for VideoGCRF. For these set of experiments, our baseline network, or basenet is a single resolution ResNet101 network, with altered network strides as in [9] to produce a spatial downsampling factor of
. The evaluation metric used is the mean pixel Intersection over Union (IoU).
In Table 1 we study the effect of varying the sizes of the spatial and temporal embeddings for frame inference. Our best results are achieved at spatiotemporal embeddings of size . The improvement over the basenet is . In subsequent experiments we fix the size of our embeddings to . We next study the effect of varying the size of the temporal context and temporal neighbourhoods. The temporal context is defined as the number of video frames which are considered simultaneously in one linear system (Eq. 2). The temporal context is limited by the GPU RAM: for a ResNet101 network, an input image of size , embeddings of size , we can currently fit frames on GB of GPU RAM. Since is smaller than the number of frames in the video, we divide the video into overlapping sets of frames, and average the predictions for the common frames.
The temporal neighbourhood for a frame (Fig. 5) is defined as the number of frames it is directly connected to via pairwise connections. A fully connected neighbourhood (fc) is one in which there are pairwise terms between every pair of frames available in the temporal context. We experiment with , , multiscale and fc connections. The neighbourhood connects a frame to neighbours at distances of , and (or ) frames on either side. Table 2 reports our results for different combinations of temporal neighbourhood and context. It can be seen that dense connections improve performance for smaller temporal contexts, but for a temporal context of frames, an increase in the complexity of temporal connections leads to a moderate decrease in performance. This could be a consequence of the longrange interactions having the same weight as shortrange interactions. In the future we intend to mitigate this issue by complementing our embeddings with the temporal distance between frames.
basenet  81.16  

VideoGCRF  spatial dimension  
temporal dimension  64  128  256  512 
64  
128  
256  
512 
basenet  81.16  

VideoGCRF  temporal neighbourhood  
temporal context  fc  
2  
3  
4  
7 
Instance Segmentation Experiments. We now demonstrate the utility of our VideoGCRF method for the task of proposalbased instance segmentation. Our hypothesis is that coupling predictions across frames is advantageous for instance segmentation methods. We actually show that the performance of the instance segmentation methods improves as we increase the temporal context via VideoGCRF, and obtain our best results with fullyconnected temporal neighbourhoods. Our baseline for this task is the MaskRCNN framework of [16] using the ResNet50 network as the convolutional body. The MaskRCNN framework uses precomputed bounding box proposals for this task. It computes convolutional features on the input image using the convolutional body network, crops out the features corresponding to image regions in the proposed bounding boxes via RegionOfInterest (RoI) pooling, and then has head networks to predict (i) class scores and bounding box regression parameters, (ii) keypoint locations, and (iii) instance masks. Structured prediction coupling the predictions of all the proposals over all the video frames is a computationally challenging task, since typically we have s of proposals per image, and it is not obvious which proposals from one frame should influence which proposals in the other frame. To circumvent this issue, we use our VideoGCRF before the RoI pooling stage as shown in Fig. 5. Instead of coupling final predictions, we thereby couple midlevel features over the video frames, thereby improving the features which are ultimately used to make predictions.
For evaluation, we use the standard COCO performance metrics: , , and AP (averaged over IoU thresholds), evaluated using mask IoU. Table 3 reports our instance segmentation results. We note that the performance of the MaskRCNN framework increases consistently as we increase the temporal context for predictions. Qualitative results are available in Fig. 7.
Method  AP  

ResNet50baseline  0.610  0.305  0.321 
spatial CRF [7]  0.618  0.310  0.329 
2frame VideoGCRF  0.619  0.310  0.331 
3frame VideoGCRF  0.631  0.321  0.330 
4frame VideoGCRF  0.647  0.336  0.349 
3.2 Instance Tracking
We use the DAVIS dataset described in Sec. 3. Instance tracking involves predicting foreground segmentation masks for each video frame given the foreground segmentation for the first video frame. We demonstrate that incorporating temporal context helps improve performance in instance tracking methods. To this end we extend the online adaptation approach of [38]
which is the stateoftheart approach on the DAVIS benchmark with our VideoGCRF. We use their publicly available software based on the TensorFlow library to generate the unary terms for each of the frames in the video, and keep them fixed. We use a ResNet50 network to generate spatiotemporal embeddings and use these alongside the unaries computed from
[38]. The results are reported in table Table 4. We compare performance of VideoGCRF against that of just the unaries from [38], and also with spatial CRFs from [7]. The evaluation criterion is the mean pixelIoU. It can be seen that temporal context improves performance. We hypothesize that reimplementing the software from [38] in Caffe2 and backpropagating on the unary branch of the network would yield further improvements.Model 
Building 
Tree 
Sky 
Car 
Sign 
Road 
Pedestrian 
Fence 
Pole 
Sidewalk 
Cyclist 
mIoU 
DeconvNet [30]  
SegNet [2]  
Bayesian SegNet [23]  
Visin et al. [37]  
FCN8 [28]  
DeepLabLFOV [8]  
Dilation8 [40]  
Dilation8 + FSO [26]  
Tiramisu [20]  
Gadde et al. [15]  
Results with our ResNet101 Implementation  
Basenet ResNet101 (Ours)  
Basenet + Spatial CRF [7]  
Basenet + 2Frame VideoGCRF  
Basenet + 3Frame VideoGCRF  
Results after Cityscapes Pretraining  
Basenet ResNet101 (Ours)  
Basenet + denseCRF postprocessing [25]  
Basenet + Spatial CRF [7]  
Basenet + 2Frame VideoGCRF  
Basenet + 3Frame VideoGCRF  
3.3 Semantic Segmentation on CamVid Dataset
We now employ our VideoGCRF for the task of semantic video segmentation on the CamVid dataset. Our base network here is our own implementation of ResNet101 with pyramid spatial pooling as in [41]. Additionally, we pretrain our networks on the Cityscapes dataset [12], and report results both with and without pretraining on Cityscapes. We report improvements over the baseline networks in both settings. Without pretraining, we see an improvement of over the basenet, and with pretraining we see an improvement of . The qualitative results are shown in Fig. 6. We notice that VideoGCRF benefits from temporal context, yielding smoother predictions across video frames.
4 Conclusion
In this work, we propose VideoGCRF, an endtoend trainable Gaussian CRF for efficient spatiotemporal structured prediction. We empirically show performance improvements on several benchmarks thanks to an increase of the temporal context. This additional functionality comes at negligible computational overhead owing to efficient implementation and the strategies to eliminate redundant computations. In future work we want to incorporate optical flow techniques in our framework as they provide a natural means to capture temporal correspondence. Further, we also intend to use temporal distance between frames as an additional term in the expression of the pairwise interactions alongside dotproducts of our embeddings. We would also like to use VideoGCRF for dense regression tasks such as depth estimation. Finally, we believe that our method for spatiotemporal structured prediction can prove useful in the unsupervised and semisupervised setting.
Gradient Expressions for SpatioTemporal GCRF Parameters
As described in the manuscript, to capture the spatiotemporal context, we propose two kinds of pairwise interactions: (a) pairwise terms between patches in the same frame (spatial pairwise terms), and (b) pairwise terms between patches in different frames (temporal pairwise terms).
Denoting the spatial pairwise terms at frame by and the temporal pairwise terms between frames as , our inference equation is written as
(12) 
where we group the variables by frames. Solving this system allows us to couple predictions across all video frames , positions, and labels . If furthermore and then the resulting system is positive definite for any positive .
As in the manuscript, at frame we couple the scores for a pair of patches taking the labels respectively as follows:
(13) 
where and , , and is the embedding associated to point .
Thus, , where . Further, to design the temporal pairwise terms, we couple patches coming from different frames taking the labels respectively as
(14) 
where .
In short, both the spatial pairwise and the temporal pairwise terms are composed as Gram matrices of spatial and temporal embeddings as , and .
From Eq. 15, we can express as follows:
(16) 
which can be compactly written as
(17) 
We will use Eq. 17 to derive gradient expressions for and .
A Gradients of the Unary Terms
As in [6, 7], the gradients of the unary terms are obtained from the solution of the following system of linear equations:
= ,  (18) 
where is the network loss. Once we have , we use it to compute the gradients of the spatiotemporal embeddings.
B Gradients of the Spatial Embeddings
We begin with the observation that computing requires us to first derive the expression for . To this end, we ignore terms from Eq. 17 that do not depend on or and write it as . We now use the result from [6, 7] that when
the gradients of are expressed as
(19) 
where denotes the Kronecker product operator.
To compute
, we use the chain rule of differentiation as follows:
(20) 
where , by definition. We know the expression for from Eq. 19, but to obtain the expression for we define a permutation matrix of size (as in [14, 7]) as follows:
(21) 
where vec is the vectorization operator that vectorizes a matrix by stacking its columns. Thus, the operator is a permutation matrix, composed of s and s, and has a single in each row and column. When premultiplied with another matrix, rearranges the ordering of rows of that matrix, while when postmultiplied with another matrix, rearranges its columns. Using this matrix, we can form the following expression [14]:
(22) 
where is the identity matrix. Substituting Eq. 19 and Eq. 22 into Eq. 20, we obtain:
(23) 
C Gradients of Temporal Embeddings
As in the last section, from Eq. 17, we ignore any terms that do not depend on or and write it as .
Using the strategies in the previous section and the sum rule of differentiation, the gradients of the temporal embeddings are given by the following form:
(24) 
References
 [1] Y. Adi, J. Keshet, E. Cibelli, and M. Goldrick. Sequence segmentation using joint RNN and structured prediction models. In ICASSP, pages 2422–2426, 2017.
 [2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoderdecoder architecture for image segmentation. In ArXiV CoRR, abs/1511.00561, 2015.
 [3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A highdefinition ground truth database. Pattern Recognition Letters, 2008.
 [4] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, 2017.
 [5] S. Caelles, K.K. Maninis, J. PontTuset, L. LealTaixé, D. Cremers, and L. Van Gool. Oneshot video object segmentation. In CVPR, 2017.
 [6] S. Chandra and I. Kokkinos. Fast, exact and multiscale inference for semantic image segmentation with deep Gaussian CRFs. In ECCV, 2016.
 [7] S. Chandra and I. Kokkinos. Dense and lowrank Gaussian CRFs using deep embeddings. In ICCV, 2017.
 [8] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. ICLR, 2015.
 [9] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915, 2016.

[10]
L.C. Chen, G. Papandreou, K. Murphy, and A. L. Yuille.
Weakly and semisupervised learning of a deep convolutional network for semantic image segmentation.
ICCV, 2015.  [11] L.C. Chen, A. G. Schwing, A. L. Yuille, and R. Urtasun. Learning Deep Structured Models. In ICML, 2015.
 [12] M. Cordts, M. Omran, S. Ramos, T. Scharwachter, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. CVPR, 2016.
 [13] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. In CVPR, pages 2625–2634, 2015.
 [14] P. L. Fackler. Notes on matrix calculus. 2005.
 [15] R. Gadde, V. Jampani, and P. V. Gehler. Semantic video CNNs through representation warping. In ICCV, 2017.
 [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. ICCV, 2017.
 [17] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, Jul 2017.
 [18] S. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. arXiv preprint arXiv:1701.05384, 2017.
 [19] V. Jampani, R. Gadde, and P. V. Gehler. Video propagation networks. In CVPR, 2017.
 [20] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio. The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1175–1183. IEEE, 2017.
 [21] X. Jin, X. Li, H. Xiao, X. Shen, Z. Lin, J. Yang, Y. Chen, J. Dong, L. Liu, Z. Jie, J. Feng, and S. Yan. Video scene parsing with predictive feature learning. CoRR, abs/1612.00119, 2016.

[22]
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei.
Largescale video classification with convolutional neural networks.
In CVPR, pages 1725–1732, 2014.  [23] A. Kendall, V. Badrinarayanan, , and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. In ArXiV CoRR, abs/1511.02680, 2015.
 [24] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele. Lucid data dreaming for object tracking. arXiv preprint arXiv:1703.09554, 2017.
 [25] P. Krähenbühl and V. Koltun. Efficient inference in fully connected CRFs with gaussian edge potentials. In NIPS, 2011.
 [26] A. Kundu, V. Vineet, and V. Koltun. Feature space optimization for semantic video segmentation. In CVPR, pages 3168–3175, 2016.
 [27] X. Li, Y. Qi, Z. Wang, K. Chen, Z. Liu, J. Shi, P. Luo, X. Tang, and C. C. Loy. Video object segmentation with reidentification. CVPR workshops  The 2017 DAVIS Challenge on Video Object segmentation, 2017.
 [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
 [29] D. Nilsson and C. Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. CoRR, abs/1612.08871, 2016.
 [30] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In arXiv preprint arXiv:1505.04366, 2015.
 [31] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A.SorkineHornung. Learning video object segmentation from static images. In Computer Vision and Pattern Recognition, 2017.
 [32] F. Perazzi, J. PontTuset, B. McWilliams, L. Van Gool, M. Gross, and A. SorkineHornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
 [33] J. PontTuset, F. Perazzi, S. Caelles, P. Arbeláez, A. SorkineHornung, and L. Van Gool. The 2017 davis challenge on video object segmentation. arXiv:1704.00675, 2017.
 [34] E. Shelhamer, K. Rakelly, J. Hoffman, and T. Darrell. Clockwork convnets for video semantic segmentation. CoRR, abs/1608.03609, 2016.
 [35] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. In https://www.cs.cmu.edu/~quakepapers/painlessconjugategradient.pdf, 1994.
 [36] R. Vemulapalli, O. Tuzel, M.Y. Liu, and R. Chellapa. Gaussian conditional random field network for semantic segmentation. In CVPR, June 2016.
 [37] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, M. Matteucci, and A. Courville. Reseg: A recurrent neural networkbased model for semantic segmentation. In CVPR workshop, 2016.
 [38] P. Voigtlaender and B. Leibe. Online adaptation of convolutional neural networks for video object segmentation. In BMVC, 2017.
 [39] T.H. Vu, A. Osokin, and I. Laptev. Contextaware CNNs for person head detection. In ICCV, pages 2893–2901, 2015.
 [40] F. Yu and V. Koltun. Multiscale context aggregation by dilated convolutions. ICLR, 2016.
 [41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. CoRR, abs/1612.01105, 2016.
 [42] S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional random fields as recurrent neural networks. In ICCV, 2015.
 [43] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei. Deep feature flow for video recognition. CoRR, abs/1611.07715, 2016.
Comments
There are no comments yet.