Deep Spatio-Temporal Random Fields for Efficient Video Segmentation

07/03/2018 ∙ by Siddhartha Chandra, et al. ∙ Facebook Inria 0

In this work we introduce a time- and memory-efficient method for structured prediction that couples neuron decisions across both space at time. We show that we are able to perform exact and efficient inference on a densely connected spatio-temporal graph by capitalizing on recent advances on deep Gaussian Conditional Random Fields (GCRFs). Our method, called VideoGCRF is (a) efficient, (b) has a unique global minimum, and (c) can be trained end-to-end alongside contemporary deep networks for video understanding. We experiment with multiple connectivity patterns in the temporal domain, and present empirical improvements over strong baselines on the tasks of both semantic and instance segmentation of videos.



There are no comments yet.


page 5

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video understanding remains largely unsolved despite significant improvements in image understanding over the past few years. The accuracy of current image classification and semantic segmentation models is not yet matched in action recognition and video segmentation, to some extent due to the lack of large-scale benchmarks, but also due to the complexity introduced by the time variable. Combined with the increase in memory and computation demands, video understanding poses additional challenges that call for novel methods.

Our objective in this work is to couple the decisions taken by a neural network in time, in a manner that allows information to flow across frames and thereby result in decisions that are consistent both spatially and temporally. Towards this goal we pursue a structured prediction approach, where the structure of the output space is exploited in order to train classifiers of higher accuracy. For this we introduce VideoGCRF, an extension into video segmentation of the Deep Gaussian Random Field (DGRF) technique recently proposed for single-frame structured prediction in

[6, 7].

Figure 1: Overview of our VideoGCRF approach: we jointly segment multiple images by passing them firstly through a fully convolutional network to obtain per-pixel class scores (‘unary’ terms U), alongside with spatial (S) and temporal (T) embeddings. We couple predictions at different spatial and temporal positions in terms of the inner product of their respective embeddings, shown here as arrows pointing to a graph edge. The final prediction is obtained by solving a linear system; this can eliminate spurious responses, e.g. on the left pavement, by diffusing the per-pixel node scores over the whole spatio-temporal graph. The CRF and CNN architecture is jointly trained end-to-end, while CRF inference is exact and particularly efficient.

We show that our algorithm can be used for a variety of video segmentation tasks: semantic segmentation (CamVid dataset), instance tracking (DAVIS dataset), and a combination of instance segmentation with Mask-RCNN-style object detection, customized in particular for the person class (DAVIS Person dataset).

Our work inherits all favorable properties of the DGRF method: in particular, our method has the advantage of delivering (a) exact inference results through the solution of a linear system, rather than relying on approximate mean-field inference, as [25, 26], (b) allowing for exact computation of the gradient during back-propagation, thereby alleviating the need for the memory-demanding back-propagation-through-time used in [42] (c) making it possible to use non-parametric terms for the pairwise term, rather than confining ourselves to pairwise terms of a predetermined form, as [25, 26], and (d) facilitating inference on both densely- and sparsely-connected graphs, as well as facilitating blends of both graph topologies.

Within the literature on spatio-temporal structured prediction, the work that is closest in spirit to ours is the work of [26] on Feature Space Optimization. Even though our works share several conceptual similarities, our method is entirely different at the technical level. In our case spatio-temporal inference is implemented as a structured, ‘lateral connection’ layer that is trained jointly with the feed-forward CNNs, while the method of [26] is applied at a post-processing stage to refine a classifier’s results.

1.1 Previous work

Structured prediction is commonly used by semantic segmentation algorithms  [6, 7, 8, 10, 11, 36, 39, 42]

to capture spatial constraints within an image frame. These approaches may be extended naively to videos, by making predictions individually for each frame. However, in doing so, we ignore the temporal context, thereby ignoring the tendency of consecutive video frames to be similar to each other. To address this shortcoming, a number of deep learning methods employ some kind of structured prediction strategy to ensure temporal coherence in the predictions. Initial attempts to capture spatio-temporal context involved designing deep learning architectures


that implicitly learn interactions between consecutive image frames. A number of subsequent approaches used Recurrent Neural Networks (RNNs)

[1, 13] to capture interdependencies between the image frames. Other approaches have exploited optical flow computed from state of the art approaches [17] as additional input to the network [15, 18]. Finally, [26] explicitly capture temporal constraints via pairwise terms over probabilistic graphical models, but operate post-hoc, i.e. are not trained jointly with the underlying network.

In this work, we focus on three problems, namely (i) semantic and (ii) instance video segmentation as well as (iii) semantic instance tracking. Semantic instance tracking refers to the problem where we are given the ground truth for the first frame of a video, and the goal is to predict these instance masks on the subsequent video frames. The first set of approaches to address this task start with a deep network pretrained for image classification on large datasets such as Imagenet or COCO, and finetune it on the first frame of the video with labeled ground truth

[5, 38], optionally leveraging a variety of data augmentation regimes [24] to increase robustness to scale/pose variation and occlusion/truncation in the subsequent frames of the video. The second set of approaches poses this problem as a warping problem [31], where the goal is to warp the segmentation of the first frame using the images and optical flow as additional inputs [19, 24, 27].

A number of approaches have attempted to exploit temporal information to improve over static image segmentation approaches for video segmentation. Clockwork convnets [34] were introduced to exploit the persistence of features across time and schedule the processing of some layers at different update rates according to their semantic stability. Similar feature flow propagation ideas were employed in [26, 43]. In [29]

segmentations are warped using the flow and spatial transformer networks. Rather than using optical flow, the prediction of future segmentations

[21] may also temporally smooth results obtained frame-by-frame. Finally, the state-of-the-art on this task [15] improves over PSPnet[41] by warping the feature maps of a static segmentation CNN to emulate a video segmentation network.

Figure 2: VideoGCRF schematic for video frames. Our network takes in input images, and delivers the per frame unaries , spatial embeddings , and temporal embeddings in the feed-forward mode. Our VideoGCRF module collects these and solves the inference problem in Eq. 2 to recover predictions . During backward pass, the gradients of the predictions are delivered to the VideoGCRF model. It uses these to compute the gradients for the unary terms as well as the spatio-temporal embeddings and back-propagates them through the network.

2 VideoGCRF

In this work we introduce VideoGCRF, extending the Deep Gaussian CRF approach introduced in [6, 7] to operate efficiently for video segmentation. Introducing a CRF allows us to couple the decisions between sets of variables that should be influencing each other; spatial connections were already explored in [6, 7] and can be understood as propagating information from distinctive image positions (e.g. the face of a person) to more ambiguous regions (e.g. the person’s clothes). In this work we also introduce temporal connections to integrate information over time, allowing us for instance to correctly segment frames where the object is not clearly visible by propagating information from different time frames.

We consider that the input to our system is a video containing frames. We denote our network’s prediction as , where at any frame the prediction

provides a real-valued vector of scores for the

classes for each of the image patches; for brevity, we denote by the number of prediction variables. The scores corresponding to a patch can be understood as inputs to a softmax function that yields the label posteriors.

The Gaussian-CRF (or, G-CRF) model defines a joint posterior distribution through a Gaussian multivariate density for a video as:

where , denote the ‘unary’ and ‘pairwise’ terms respectively, with and . In the rest of this work we assume that depend on the input video and we omit the conditioning on for convenience.

What is particular about the G-CRF is that, assuming the matrix of pairwise terms is positive-definite, the Maximum-A-Posterior (MAP) inference merely amounts to solving the system of linear equations . In fact, as in [6], we can drop the probabilistic formulation and treat the G-CRF as a structured prediction module that is part of a deep network. In the forward pass, the unary and the pairwise terms and , delivered by a feed-forward CNN described in Sec. 2.1 are fed to the G-CRF module which performs inference to recover the prediction x by solving a system of linear equations given by


where is a small positive constant added to the diagonal entries of to make it positive definite.

For the single-frame case () the iterative conjugate gradient [35] algorithm was used to rapidly solve the resulting system for both sparse [6] and fully connected [7] graphs; in particular the speed of the resulting inference is in the order of 30ms on the GPU, almost two orders of magnitude faster than the implementation of DenseCRF [25], while at the same time giving more accurate results.

Our first contribution in this work consists in designing the structure of the matrix so that the resulting system solution remains manageable as the number of frames increases. Once we describe how we structure , we then will turn to learning our network in an end-to-end manner.

2.1 Spatio-temporal connections

In order to capture the spatio-temporal context, we are interested in capturing two kinds of pairwise interactions: (a) pairwise terms between patches in the same frame and (b) pairwise terms between patches in different frames.

Denoting the spatial pairwise terms at frame by and the temporal pairwise terms between frames as we can rewrite Eq. 1 as follows:


where we group the variables by frames. Solving this system allows us to couple predictions across all video frames , positions, and labels . If furthermore and then the resulting system is positive definite for any positive .

We now describe how the pairwise terms are constructed through our CNN, and then discuss acceleration of the linear system in Eq. 2 by exploiting its structure.

Spatial Connections: We define the spatial pairwise terms in terms of inner products of pixel-wise embeddings, as in [7]. At frame we couple the scores for a pair of patches taking the labels respectively as follows:


where and , , and is the embedding associated to point . In Eq. 3 the terms are image-dependent and delivered by a fully-convolutional “embedding” branch that feeds from the same CNN backbone architecture, and is denoted by in Fig. 2.

The implication of this form is that we can afford inference with a fully-connected graph. In particular the rank of the block matrix , equals the embedding dimension , which means that both the memory- and time- complexity of solving the linear system drops from to , which can be several orders of magnitude smaller. Thus,

Temporal Connections: Turning to the temporal pairwise terms, we couple patches coming from different frames taking the labels respectively as


where . The respective embedding terms are delivered by a branch of the network that is separate, temporal embedding network denoted by in Fig. 2.

In short, both the spatial pairwise and the temporal pairwise terms are composed as Gram matrices of spatial and temporal embeddings as , and . We visualize our spatio-temporal pairwise terms in Fig. 3.

VideoGCRF in Deep Learning: Our proposed spatio-temporal Gaussian CRF (VideoGCRF) can be viewed as generic deep learning modules for spatio-temporal structured prediction, and as such can be plugged in at any stage of a deep learning pipeline: either as the last layer, i.e. classifier, as in our semantic segmentation experiments (Sec. 3.3), or even in the low-level feature learning stage, as in our instance segmentation experiments (Sec. 3.1).

2.2 Efficient Conjugate-Gradient Implementation

We now describe an efficient implementation of the conjugate gradient method [35], described in Algorithm 1 that is customized for our VideoGCRFs.

1:procedure ConjugateGradient
2:     Input: , ,    Output:
3:     ;    ;   
4:     repeat
12:     end repeat
Algorithm 1 Conjugate Gradient Algorithm

The computational complexity of the conjugate gradient algorithm is determined by the computation of the matrix-vector product , corresponding to line :7 of Algorithm 1 (we drop the subscript for convenience).

We now discuss how to efficiently compute in a manner that is customized for this work. In our case, the matrix-vector product is expressed in terms of the spatial () and temporal () embeddings as follows:


From Eq. 5, we can express as follows:


One optimization that we exploit in computing efficiently is that we do not ‘explicitly’ compute the matrix-matrix products or . We note that can be decomposed into two matrix-vector products as , where the expression in the brackets is evaluated first and yields a vector, which can then be multiplied with the matrix outside the brackets. This simplification alleviates the need to keep terms in memory, and is computationally cheaper.

Further, from Eq. 6, we note that computation of requires the matrix-vector product . A black-box implementation would therefore involve redundant computations, which we eliminate by rewriting Eq. 6 as:


This rephrasing allows us to precompute and cache , thereby eliminating redundant calculations.

While so far we have assumed dense connections between the image frames, if we have sparse temporal connections (Sec. 3.1), i.e. each frame is connected to a subset of neighbouring frames in the temporal domain, the linear system matrix is sparse, and is written as


where denotes the temporal neighbourhood of frame . For very sparse connections caching may not be necessary because these involve little or no redundant computations.

2.3 Backward Pass

Since we rely on the Gaussian CRF we can get the back-propagation equation for the gradient of the loss with respect to the unary terms, , and the spatial/temporal embedding terms in closed form. Thanks to this we do not have to perform back-propagation in time which was needed e.g. in [42] for DenseCRF inference. Following [7], the gradients of the unary terms are obtained from the solution of the following system:


Once these are computed, the gradients of the spatial embeddings can be computed as follows:


while the gradients of the temporal embeddings are given by the following form:


where is a permutation matrix, as in [7].

2.4 Implementation and Inference Time

Our implementation is GPU based and exploits fast CUDA-BLAS linear algebra routines. It is implemented as a module in the Caffe2 library. For spatial and temporal embeddings of size , classes (Sec. 3.3), a

input image, and network stride of

, our frame inferences take s, s and s on average respectively. Without the caching procedure described in Sec. 2.2, the frame inference takes s on average. This is orders of magnitude faster than the DenseCRF method [25] which takes

s on average for spatial CRF for a single input frame. These timing statistics were estimated on a

GTX-1080 GPU.

Figure 3: Visualization of instance segmentation through VideoGCRF: In row 1 we focus on a single point of the CRF graph, shown as a cross, and show as a heatmap its spatial (inter-frame) and temporal (intra-frame) affinities to all other graph nodes. These correspond to a single column of the linear system in Eq. 2. In row 2 we show the predictions that would be obtained by frame-by-frame segmentation, relying exclusively on the FCN’s unary terms, while in row 3 we show the results obtained after solving the VideoGCRF inference problem. We observe that in frame-by-frame segmentation a second camel is incorrectly detected due to its similar appearance properties. However, VideoGCRF inference exploits temporal context and focuses solely on the correct object.

3 Experiments

Figure 4: Temporal neighbourhoods in our ablation study: boxes denote video frames and the arcs connecting them are pairwise connections. The frame in red has all neighbours present in the temporal context.
Figure 5: Spatio-temporal structured prediction in Mask-RCNN. Here we use CRFs in the feature learning stage before the ROI-Pooling (and not as the final classifier). This helps learn mid-level features which are better aware of the spatio-temporal context.

Experimental Setup. We describe the basic setup followed for our experiments. As in [7], we use a phase training strategy for our methods. We first train the unary network without the spatio-temporal embeddings. We next train the subnetwork delivering the spatio-temporal embeddings with the softmax cross-entropy loss to enforce the following objectives: , and , where are the ground truth labels for pixels

. Finally, we combine the unary and pairwise networks, and train them together in end-to-end fashion. Unless otherwise stated, we use stochastic gradient descent to train our networks with a momentum of

and a weight decay of . For segmentation experiments, we use a base-learning rate of for training the unaries, for training the embeddings, and for finetuning the unary and embeddings together, using a polynomial-decay with power of . For the instance segmentation network, we use a single stage training for the unary and pairwise streams: we train the network for K iterations, with a base learning rate of which is reduced to after K iterations. The weight decay is . For our instance tracking experiments, we use unaries from [38] and do not refine them, rather use them as an input to our network. We employ horizontal flipping and scaling by factors between and during training/testing for all methods, except in the case of instance segmentation experiments (Sec. 3.1).

Datasets. We use the three datasets for our experiments:

DAVIS. The DAVIS dataset [32] consists of training and validation videos containing and frames respectively. Each video comes with manually annotated segmentation masks for foreground object instances.

DAVIS-Person. While the DAVIS dataset [33] provides densely annotated frames for instance segmentation, it lacks object category labels. For category prediction tasks such as semantic and instance segmentation, we create a subset of the DAVIS dataset containing videos from the category person. By means of visual inspection, we select and video sequences from the training and validation sets respectively containing training and validation images, each containing at least one person. Since the DAVIS dataset comes with only the foreground instances labeled, we manually annotate the image regions containing unannotated person instances with the do-not-care label. These image regions do not participate in the training or the evaluation. We call this the DAVIS-person dataset.

CamVid. The CamVid dataset [4, 3]

, is a dataset containing videos of driving scenarios for urban scene understanding. It comes with

images annotated with pixel-level category labels at fps. Although the original dataset comes with class-labels, as in [2, 26, 20], we predict semantic classes and use the train-val-test split of , and frames respectively.

3.1 Ablation Study on Semantic and Instance Segmentation Tasks

In these experiments, we use the DAVIS Person dataset described in Sec. 3. The aim here is to explore the various design choices available to us when designing networks for spatio-temporal structured prediction for semantic segmentation, and proposal-based instance segmentation tasks.

Semantic Segmentation Experiments. Our first set of experiments studies the effect of varying the sizes of the spatial and temporal embeddings, the degree of the temporal connections, and multi-scale temporal connections for VideoGCRF. For these set of experiments, our baseline network, or base-net is a single resolution ResNet-101 network, with altered network strides as in [9] to produce a spatial down-sampling factor of

. The evaluation metric used is the mean pixel Intersection over Union (IoU).

In Table 1 we study the effect of varying the sizes of the spatial and temporal embeddings for frame inference. Our best results are achieved at spatio-temporal embeddings of size . The improvement over the base-net is . In subsequent experiments we fix the size of our embeddings to . We next study the effect of varying the size of the temporal context and temporal neighbourhoods. The temporal context is defined as the number of video frames which are considered simultaneously in one linear system (Eq. 2). The temporal context is limited by the GPU RAM: for a ResNet-101 network, an input image of size , embeddings of size , we can currently fit frames on GB of GPU RAM. Since is smaller than the number of frames in the video, we divide the video into overlapping sets of frames, and average the predictions for the common frames.

The temporal neighbourhood for a frame (Fig. 5) is defined as the number of frames it is directly connected to via pairwise connections. A fully connected neighbourhood (fc) is one in which there are pairwise terms between every pair of frames available in the temporal context. We experiment with , , multiscale and fc connections. The neighbourhood connects a frame to neighbours at distances of , and (or ) frames on either side. Table 2 reports our results for different combinations of temporal neighbourhood and context. It can be seen that dense connections improve performance for smaller temporal contexts, but for a temporal context of frames, an increase in the complexity of temporal connections leads to a moderate decrease in performance. This could be a consequence of the long-range interactions having the same weight as short-range interactions. In the future we intend to mitigate this issue by complementing our embeddings with the temporal distance between frames.

base-net 81.16
VideoGCRF spatial dimension
temporal dimension 64 128 256 512
Table 1: Ablation study: mean IoU on the DAVIS-person dataset using frame fc connections. We study the effect of varying the size of the spatial & temporal embeddings.
base-net 81.16
VideoGCRF temporal neighbourhood
temporal context fc
Table 2: Ablation study: mean IoU on the DAVIS-person dataset. Here we study the effect of varying the size of the temporal context and neighbourhood.

Instance Segmentation Experiments. We now demonstrate the utility of our VideoGCRF method for the task of proposal-based instance segmentation. Our hypothesis is that coupling predictions across frames is advantageous for instance segmentation methods. We actually show that the performance of the instance segmentation methods improves as we increase the temporal context via VideoGCRF, and obtain our best results with fully-connected temporal neighbourhoods. Our baseline for this task is the Mask-RCNN framework of [16] using the ResNet-50 network as the convolutional body. The Mask-RCNN framework uses precomputed bounding box proposals for this task. It computes convolutional features on the input image using the convolutional body network, crops out the features corresponding to image regions in the proposed bounding boxes via Region-Of-Interest (RoI) pooling, and then has head networks to predict (i) class scores and bounding box regression parameters, (ii) keypoint locations, and (iii) instance masks. Structured prediction coupling the predictions of all the proposals over all the video frames is a computationally challenging task, since typically we have s of proposals per image, and it is not obvious which proposals from one frame should influence which proposals in the other frame. To circumvent this issue, we use our VideoGCRF before the RoI pooling stage as shown in Fig. 5. Instead of coupling final predictions, we thereby couple mid-level features over the video frames, thereby improving the features which are ultimately used to make predictions.

For evaluation, we use the standard COCO performance metrics: , , and AP (averaged over IoU thresholds), evaluated using mask IoU. Table 3 reports our instance segmentation results. We note that the performance of the Mask-RCNN framework increases consistently as we increase the temporal context for predictions. Qualitative results are available in Fig. 7.

Method AP
ResNet50-baseline 0.610 0.305 0.321
spatial CRF [7] 0.618 0.310 0.329
2-frame VideoGCRF 0.619 0.310 0.331
3-frame VideoGCRF 0.631 0.321 0.330
4-frame VideoGCRF 0.647 0.336 0.349
Table 3: Instance Segmentation using ResNet-50 Mask R-CNN on the Davis Person Dataset
Method mean IoU
Mask Track [31] 79.7
OSVOS [5] 79.8
Online Adaptation [38] 85.6
Online Adaptation + Spatial CRF [7] 85.9
Online Adaptation + 2-Frame VideoGCRF 86.3
Online Adaptation + 3-Frame VideoGCRF 86.5
Table 4: Instance Tracking on the Davis val Dataset

3.2 Instance Tracking

We use the DAVIS dataset described in Sec. 3. Instance tracking involves predicting foreground segmentation masks for each video frame given the foreground segmentation for the first video frame. We demonstrate that incorporating temporal context helps improve performance in instance tracking methods. To this end we extend the online adaptation approach of [38]

which is the state-of-the-art approach on the DAVIS benchmark with our VideoGCRF. We use their publicly available software based on the TensorFlow library to generate the unary terms for each of the frames in the video, and keep them fixed. We use a ResNet-50 network to generate spatio-temporal embeddings and use these alongside the unaries computed from

[38]. The results are reported in table Table 4. We compare performance of VideoGCRF against that of just the unaries from [38], and also with spatial CRFs from [7]. The evaluation criterion is the mean pixel-IoU. It can be seen that temporal context improves performance. We hypothesize that re-implementing the software from [38] in Caffe2 and back-propagating on the unary branch of the network would yield further improvements.














DeconvNet [30]
SegNet [2]
Bayesian SegNet [23]
Visin et al. [37]
FCN8 [28]
DeepLab-LFOV [8]
Dilation8 [40]
Dilation8 + FSO [26]
Tiramisu [20]
Gadde et al. [15]
Results with our ResNet-101 Implementation
Basenet ResNet-101 (Ours)
Basenet + Spatial CRF [7]
Basenet + 2-Frame VideoGCRF
Basenet + 3-Frame VideoGCRF
Results after Cityscapes Pretraining
Basenet ResNet-101 (Ours)
Basenet + denseCRF post-processing [25]
Basenet + Spatial CRF [7]
Basenet + 2-Frame VideoGCRF
Basenet + 3-Frame VideoGCRF
Table 5: Results on CamVid dataset. We compare our results with some of the previously published methods, as well as our own implementation of the ResNet-101 network which serves as our base network.
Figure 6: Qualitative results on the CamVid dataset. We note that the temporal context from neighbouring frames helps improve the prediction of the truck on the right in the first video, and helps distinguish between the road and the pavement in the second video, overall giving us smoother predictions in both cases.

3.3 Semantic Segmentation on CamVid Dataset

We now employ our VideoGCRF for the task of semantic video segmentation on the CamVid dataset. Our base network here is our own implementation of ResNet-101 with pyramid spatial pooling as in [41]. Additionally, we pretrain our networks on the Cityscapes dataset [12], and report results both with and without pretraining on Cityscapes. We report improvements over the baseline networks in both settings. Without pretraining, we see an improvement of over the base-net, and with pretraining we see an improvement of . The qualitative results are shown in Fig. 6. We notice that VideoGCRF benefits from temporal context, yielding smoother predictions across video frames.

4 Conclusion

In this work, we propose VideoGCRF, an end-to-end trainable Gaussian CRF for efficient spatio-temporal structured prediction. We empirically show performance improvements on several benchmarks thanks to an increase of the temporal context. This additional functionality comes at negligible computational overhead owing to efficient implementation and the strategies to eliminate redundant computations. In future work we want to incorporate optical flow techniques in our framework as they provide a natural means to capture temporal correspondence. Further, we also intend to use temporal distance between frames as an additional term in the expression of the pairwise interactions alongside dot-products of our embeddings. We would also like to use VideoGCRF for dense regression tasks such as depth estimation. Finally, we believe that our method for spatio-temporal structured prediction can prove useful in the unsupervised and semi-supervised setting.

Figure 7: Instance Segmentation results on the DAVIS Person Dataset. We observe that prediction based on unary terms alone leads to missing instances and some false predictions. These errors are corrected by VideoGCRFs, which smooth the predictions by taking into account the temporal context.

Gradient Expressions for Spatio-Temporal G-CRF Parameters

As described in the manuscript, to capture the spatio-temporal context, we propose two kinds of pairwise interactions: (a) pairwise terms between patches in the same frame (spatial pairwise terms), and (b) pairwise terms between patches in different frames (temporal pairwise terms).

Denoting the spatial pairwise terms at frame by and the temporal pairwise terms between frames as , our inference equation is written as


where we group the variables by frames. Solving this system allows us to couple predictions across all video frames , positions, and labels . If furthermore and then the resulting system is positive definite for any positive .

As in the manuscript, at frame we couple the scores for a pair of patches taking the labels respectively as follows:


where and , , and is the embedding associated to point .

Thus, , where . Further, to design the temporal pairwise terms, we couple patches coming from different frames taking the labels respectively as


where .

In short, both the spatial pairwise and the temporal pairwise terms are composed as Gram matrices of spatial and temporal embeddings as , and .

Using the definitions from Eq. 13 and Eq. 14, we can rewrite the inference equation as

= (15)

From Eq. 15, we can express as follows:


which can be compactly written as


We will use Eq. 17 to derive gradient expressions for and .

A Gradients of the Unary Terms

As in [6, 7], the gradients of the unary terms are obtained from the solution of the following system of linear equations:

= , (18)

where is the network loss. Once we have , we use it to compute the gradients of the spatio-temporal embeddings.

B Gradients of the Spatial Embeddings

We begin with the observation that computing requires us to first derive the expression for . To this end, we ignore terms from Eq. 17 that do not depend on or and write it as . We now use the result from [6, 7] that when

the gradients of are expressed as


where denotes the Kronecker product operator.

To compute

, we use the chain rule of differentiation as follows:


where , by definition. We know the expression for from Eq. 19, but to obtain the expression for we define a permutation matrix of size (as in [14, 7]) as follows:


where vec is the vectorization operator that vectorizes a matrix by stacking its columns. Thus, the operator is a permutation matrix, composed of s and s, and has a single in each row and column. When premultiplied with another matrix, rearranges the ordering of rows of that matrix, while when postmultiplied with another matrix, rearranges its columns. Using this matrix, we can form the following expression [14]:


where is the identity matrix. Substituting Eq. 19 and Eq. 22 into Eq. 20, we obtain:


C Gradients of Temporal Embeddings

As in the last section, from Eq. 17, we ignore any terms that do not depend on or and write it as .

Using the strategies in the previous section and the sum rule of differentiation, the gradients of the temporal embeddings are given by the following form: