Recurrent Fully Convolutional Networks for Video Segmentation

by   Sepehr Valipour, et al.

Image segmentation is an important step in most visual tasks. While convolutional neural networks have shown to perform well on single image segmentation, to our knowledge, no study has been been done on leveraging recurrent gated architectures for video segmentation. Accordingly, we propose a novel method for online segmentation of video sequences that incorporates temporal data. The network is built from fully convolutional element and recurrent unit that works on a sliding window over the temporal data. We also introduce a novel convolutional gated recurrent unit that preserves the spatial information and reduces the parameters learned. Our method has the advantage that it can work in an online fashion instead of operating over the whole input batch of video frames. The network is tested on the change detection dataset, and proved to have 5.5% improvement in F-measure over a plain fully convolutional network for per frame segmentation. It was also shown to have improvement of 1.4% for the F-measure compared to our baseline network that we call FCN 12s.


page 2

page 5

page 6


Convolutional Gated Recurrent Networks for Video Segmentation

Semantic segmentation has recently witnessed major progress, where fully...

Recurrent neural networks for aortic image sequence segmentation with sparse annotations

Segmentation of image sequences is an important task in medical image an...

Face Mask Extraction in Video Sequence

Inspired by the recent development of deep network-based methods in sema...

Detection and Tracking of Liquids with Fully Convolutional Networks

Recent advances in AI and robotics have claimed many incredible results ...

Beyond One Glance: Gated Recurrent Architecture for Hand Segmentation

As mixed reality is gaining increased momentum, the development of effec...

Hierarchical Recurrent Filtering for Fully Convolutional DenseNets

Generating a robust representation of the environment is a crucial abili...

Separable Convolutional LSTMs for Faster Video Segmentation

Semantic Segmentation is an important module for autonomous robots such ...

1 Introduction

The recent trend in convolutional neural networks has dramatically changed the landscape in computer vision. The first task that was improved with this trend was object recognition

[13][21][23]. An even harder task that greatly progressed is semantic segmentation, which provides per pixel labelling as introduced in [16] [29][26]. In [16] fully convolutional network was introduced. These networks yield a coarse segmentation map for any given image, and it is followed by upsampling within the network to get dense predictions. This method enabled an end-to-end training for the task of semantic segmentation of images. However, one missing element in this recent trend is that real-world is not a set of static images. Large portion of the information that we infer from the environment comes from motion. For example, in an activity recognition task, the difference between walking and standing is only profound if you consider a sequence of images.

Figure 1: Overview of the Proposed Method of Recurrent FCN. The recurrent part is unrolled for better visualisation

The conventional Convolutional Neural Networks (CNN) are not designed to include temporal changes. The simplest way to include temporal information in CNN is to concatenate multiple frames and feed it as a single input. Small variations of this method are used for context classification on one million youtube videos[12]. Surprisingly, it could not improve on single frame prediction by much which can indicate the inefficiency of this approach. Michalski et al.[17] created a network which learns transformations between frames of a long video sequence. In [24]

convolutional restricted Boltzman machine is introduced which learns optical-flow-like features from input image sequence. Another proposed method


uses Recurrent Neural Networks (RNN) which have shown their power in different tasks. It introduces a combination of CNN and RNN for video segmentation. However, their RNN architecture is sensitive to initialization, and training is difficult due to the vanishing gradient problem. Their design does not allow usage of a pre-trained network and it cannot process large input images as the number of parameters in their network grows exponentially with the input size.

Several architectures are proposed that aim to solve the main bottleneck of recurrent networks namely, vanishing or exploding gradients. In [9]Long Short Term Memory (LSTM) is presented. It is used for various applications such as generating sequences for text [8], dense image captioning[11] and video captioning [6]. Another recently proposed architecture is the Gated Recurrent Unit (GRU) [4]. It was shown in [5]

that LSTM and GRU outperform other traditional recurrent architectures, and that GRU showed similar performance to LSTM but with reduced number of parameters. One problem with these previous architectures is that they only work with vectorized sequences as input. Thus, they are incapable of handling data where spatial information is critical like images or feature maps. Another work

[1] uses convolutional GRU for learning spatiotemporal features from videos that tackle this issue. In their work, experiments on video captioning and human action recognition were conducted.

The usage of fully convolutional networks (FCN) combined with recurrent gated units can solve many of the pitfalls of the previous approaches. In this paper, we present : (1) A novel architecture that can incorporate temporal data directly into FCN for video segmentation. We have chosen the Recurrent Neural Network as the foundation of our structure since it is shown to be effective in learning temporal dynamics. (2) An end-to-end training method for online video segmentation that does not need to process data offline. Overview of the suggested method is presented in Figure 1, where a sliding window over the frames is used and passed through the recurrent fully convolutional network(RFCN). To our knowledge, this is the first work that presents a recurrent fully convolutional network for pixel-wise labelling.

The paper is structured as follows. In Section 2 we will discuss the preliminary topics, then the proposed method will be presented in details in section 3. It is followed by the experiments section and discussion of the results in section 4. Finally, section 5 concludes the paper and presents potential future directions.

2 Background

This section will review FCN and RNN which will be repeatedly referred to through the article.

2.1 Fully Convolutional Networks (FCN)

In convolutional neural networks that are used for classification, the few last fully connected layers are responsible for the classification part. But, with pixel-wise labelling, there is a need for dense predictions on all the pixels. In [15] the idea of using a fully convolutional neural network that is trained for pixel-wise semantic segmentation is presented. It is shown that it surpasses the state of the art in semantic segmentation on PASCAL VOC, NUYDv2, and SIFT Flow datasets. The FCN method is briefly discussed in what follows.

FCN architecture is based on VGG[21] architecture due to its success in classification tasks. However, due to the fully connected layers that these networks have, they can only accept fixed size input and produce a classification label. To overcome this problem, it is possible to convert a fully connected layer into a convolutional layer. Accordingly, this network can yield coarse maps pixel wise prediction instead of one classification output.

In order to have dense prediction from this coarse map, it needs to be up-sampled to the original size of the input image. The up-sampling method can be a simple bi-linear interpolation. But in

[15] a new layer that applies upsampling within the network was presented. It makes it efficient to learn the up-sampling weights within the network using back-propagation. The filters of the deconvolution layer act as the basis to reconstruct the input image. Another idea for up-sampling is to stitch together output maps from shifted version of the input. But It was mentioned in [15] that using up-sampling with deconvolution is more effective. In [18] the idea of having a full deconvolution network with both deconvolution layers and unpooling layers is presented.

The FCN architecture has been tried in different applications. In [10] it is used for object localization. In [27] a modified architecture was used for visual object tracking. Finally for semantic segmentation in [18] a full deconvolution network is presented with stacked deconvolution layers.

2.2 Recurrent Neural Networks

Recurrent Neural Networks[25] are designed to incorporate sequential information into a neural network framework. These networks are capable of learning complex dynamics by utilizing a hidden unit in each recurrent cell. This unit works like a dynamic memory that can be changed based on the state that the unit is in. Accordingly, the process of each unit yields to two outcomes. Firstly, an output is computed from the current input and the hidden units values (the networks memory). Secondly, the network updates its memory based on, again, current input and hidden units value. The simplest recurrent unit can be modeled as1.


Here, is the hidden layer, is the input layer and is the output layer and

is the activation function.

Recurrent networks were successful in many tasks in speech recognition and text understanding[22] but they come with their challenges. Unrestricted data flow between units causes problems with vanishing and exploding gradients [3]. During the back propagation through recurrent units, the derivative of each node is dependent of all the nodes which processed earlier. This is shown in equations 2,3 and 4 where is the loss of the layer. To compute a series of multiplication from to is required. Assume that is bounded by then


A solution to this problem is to use gated structures. The gates can control back propagation flow between each node. Long-Short Term Memory [9] is the first such proposed architecture and it is still popular. A more recent architecture is Gated Recurrent Unit [4] which has simpler cells yet with competent performance [5].

2.2.1 Long Short Term Memory (LSTM)

As mentioned, LSTM uses a gated structure where each gate controls the flow of a particular signal. Each LSTM node has three gates that are input, output and forget gate each with learnable weights. These gates can learn the optimal way to remember useful information from previous states and decide the current state. In equations 5 the procedure of computing different gates and hidden states is shown, where , and are input, forget and output gates respectively. While denote the cell internal state, and is the hidden state.


2.2.2 Gated Recurrent Unit (GRU)

The Gated Recurrent Unit, similar to LSTM, utilizes a gated structure for flow-control. However, it has a simpler architecture which makes it both faster and less memory consuming. The model is shown in Figure 2 and described in 6 where , is the reset and update gate respectively. While is the hidden state.


GRU does not have direct control over memory content exposure while LSTM has it by having an output gate. These two are also different in the way that they update the memory nodes. LSTM updates its hidden state by summation over flow after input gate and forget gate. GRU however, assumes a correlation between how much to keep from the current state and how much to get from the previous state and it models this with the gate.

Figure 2: GRU Architecture.

3 Method

In an abstract view, we use a recurrent fully convolutional network (RFCN) that utilizes the temporal as well as spatial information for segmentation. The general theme in the design is to use recurrent nodes that combine fully convolutional network with a Recurrent unit(RU). The recurrent unit denotes either LSTM, GRU or Conv-GRU (which is explained in 3.2). In all of our networks, we aim for online segmentation in contrast to batch/offline version which needs the whole video as input. This is done by using a sliding window over the frames. Then, each window is propagated through the RFCN and yields a segmentation corresponding to the last frame in the sliding window. The recurrent layer can be employed to a sequence of feature maps or heat maps where each element in the series is the result of an image forward pass through an FCN network (Figure 1). The whole network is trained end-to-end using pixel-wise classification logarithmic loss. We designed different network architectures to this end using conventional and also convolutional recurrent unit 1.

3.1 Conventional Recurrent Unit for Segmentation

Our first architecture uses the Lenet network converted to a fully convolutional network as the base. Lenet is a well known and shallow network and as it is common, we used it for early experiments. We embed this model in recurrent node to allow the network to use temporal data. In Table 1, RFC-Lenet is delineating this architecture. The output of deconvolution inside the fcn is a 2D map of dense predictions that is then flattened into 1D vector as input to the recurrent unit. The recurrent unit takes a vector from each frame in the sliding window and outputs the segmentation of the last frame (Figure 1).

Note that utilizing the first architecture requires using large weight matrix in the RU layer since it operates on the flattened vectors of the full sized image. To lessen this problem, we can apply the deconvolution layer after the recurrent node. This will lead to our second architecture. The RU receives a vector of the flattened coarse map as its input and outputs the coarse predictions of the last frame in the sliding window. Then, deconvolution of this coarse map is used to get dense predictions. This proved to be useful with larger input images that lead to large number of parameters in the RU. Decreasing their size, allows the optimizer to search in a smaller state space and find a more generalized local minima in a faster training time. RFC-12s in the Table 1 is an example of this approach. This network is slightly modified version of the Lenet FCN. The main difference is that now the deconvolution comes as the last step after recursion being done.

3.2 Convolutional Gated Recurrent Unit (Conv-GRU) for Segmentation

Figure 3: The architecture of RFC-VGG. Images are fed frame by frame into a recurrent FCN. A Conv-GRU layer is applied on the feature maps produced by the preceding network at each frame. The output of this layer goes to one more convolutional layer to generate heat maps. Finally, a deconvolution layer up-samples the heat map to the desired spatial size.

Conventional recurrent units are capable of processing temporal data however, their architecture is not suitable for working on images/feature maps for two reasons. 1) weights matrix size, 2) ignoring spatial connectivity. Assume a case where a recurrent unit is placed after a feature map with the spatial size of and have a number of channels . After flattening, it will turn into a long matrix. Therefore, weights of the recurrent unit will be of size

which is power four of spatial dimension. These matrices for weights can only be maintained for small feature maps. Even if the computation was not an issue, such design introduces too much variance in the network which prevents generalization. In Convolutional recurrent units, similar to regular convolutional layer, weights are three dimensional and they convolve with the input instead of dot product. Accordingly, the cell’s model, in the case of a GRU architecture, will turn into equations

7 where the dot products are replaced with convolutions. In this design, weights matrices are of size where , , and are kernel’s height, kernel’s width, number of input channels, and number of filters, respectively. In Figure 2 the operations applied on the input and the previous step will all be convolutions instead. Since we can assume spatial connectivity in feature maps, kernel size can be very small compared to feature map’s spatial size. Therefore, this architecture is much more efficient and weights are easier to learn due to smaller search space.

We employ this approach for segmentation in a fully convolutional network. It is possible to apply this layer on either heat maps or feature maps. In the first case, the output of this layer will directly feed into the deconvolution layer and produces the pixel-wise probability map. In the latter case, at least one CNN layer needs to be used after the recurrent layer to convert its output feature maps to a heat map.

RFC-VGG in the Table 1 is an example of the second case. It is based on VGG-F [21]

network. Initializing weights of our filters by VGG-F trained weights, alleviates over-fitting problems as these weights are the result of extensive training on the imagenet. The network is cast to a fully convolutional one by replacing the fully connected layers with convolutional layers. The last two pooling layers are dropped from VGG-F to allow a finer segmentation. Then a convolutional gated recurrent unit is used followed by one convolutional layer and then deconvolution for up-sampling. Figure

3 shows the detailed architecture of RFC-VGG.


4 Experiments

This section presents our experiments and results. First, we describe the datasets that we used then, we discuss our training methods and hyper-parameters settings. Finally, quantitative and qualitative results are shown.

Figure 4: Qualtitative results over Segtrack V2 and Davis datasets, where top image has overlay of FC-VGG and bottom has RFC-VGG segmentation.
Network Architectures
input: 2828 input: 120180 input: 240360

Recurrent Node

Conv: F(5), P(10), D(20)

Recurrent Node

Conv: F(5), S(3), P(10), D(20)

Recurrent Node

Conv: F(11), S(4), P(40), D(64)
Relu Relu Relu
Pool 22 Pool 22 Pool 33
Conv: F(5), D(50) Conv: F(5), D(50) Conv: F(5), P(2) D(256)
Relu Relu Relu
Pool(22) Pool(22) Pool(33)
Conv: F(3), D(500) Conv: F(3), D(500) Conv: F(3), P(1) D(256)
Relu Relu Relu
Conv: F(1), D(1) Conv: F(1), D(1) Conv: F(3), P(1) D(256)
- - Relu
Conv: F(3), P(1) D(256)
Conv: F(3), D(512)
Conv: F(3), D(128)
DeConv: F(10), S(4) Flatten ConvGRU: F(3), D(128)
Flatten GRU: W(100100) Conv: F(1), D(1)
GRU: W(784784) DeConv: F(10), S(4) DeConv: F(20), S(8)
Table 1: Details of proposed networks. denotes filter size of . denotes total of

zero padding around the feature map.

denotes stride of length

for the convolution. denotes number of output feature maps from a particular layer for a layer (number of feature maps is same as previous layer if is not mentioned).

All the experiments are performed on our own implemented library. To our knowledge, there is no open source framework that accommodates RFCNN architecture. Therefore, we built our own library on top of Theano

[2] to create arbitrary networks that have can have FCN as a recursive module. Key features of this implementation are: (1) Supports networks with the temporal operation for images. The architecture can be any arbitrary CNN and an arbitrary number of recurrent layers. The network supports any length input. (2) Three gated architecture, LSTM, GRU, and Conv-GRU is available for the recurrent layer. (3) Deconvolution layer and skip architecture to support segmentation with FCN.

4.1 Datasets

In this paper four datasets are used: 1) Moving MNIST. 2) Change detection[7]. 3) Segtrack version 2[14]. 4) Densely Annotated VIdeo Segmentation (Davis) [20]. Figure 4 shows samples from the latter two.

Moving MNIST dataset is synthesized from original MNIST by moving the characters in random but consistent directions. The labels for segmentation is generated by thresholding input images after translation. We consider each translated image as a new frame. Therefore we can have arbitrary length sequence of images.

Change Detection Dataset[7] This dataset provides realistic, diverse set of videos with pixel-wise labeling of moving objects. The dataset includes both indoor and outdoor scenes. It focuses on moving object segmentation. In the motion detection, we were looking for videos that have similar moving objects, e.g. cars or humans so that there would be semantic correspondences among sequences. Accordingly, we chose six videos: Pedestrians, PETS2006, Badminton, CopyMachine, Office, and Sofa.

SegTrack V2[14] is a collection of fourteen video sequences with objects of interest manually segmented. The dataset has sequences with both single or multiple objects. In the latter case, we consider all the segmented objects as one and we perform foreground segmentation.

Davis[20] dataset has fifty high resolution and densely annotated videos with pixel accurate groundtruth. The videos include multiple challenges such as occlusions, fast motion, nonlinear deformation and motion blur.

4.2 Results

The main experiments are conducted using Adadelta [28]

for optimization that practically gave much faster convergence than standard stochastic gradient descent. The logistic loss function is used and the maximum number of epochs used for the training is 500. The evaluation metrics used are precision, recall, F-measure and IoU, shown in equation

8 and 10 where tp, fp, fn denote true positives, false positives, and false negatives respectively.

Precision Recall F-measure IoU
SegTrack V2 FC-VGG 0.7759 0.6810 0.7254 0.7646
RFC-VGG 0.8325 0.7280 0.7767 0.8012
DAVIS FC-VGG 0.6834 0.5454 0.6066 0.6836
RFC-VGG 0.7233 0.5586 0.6304 0.6984
Table 2: Comparison of RFC-VGG with its baseline counterpart on DAVIS and SegTrack

In this set of experiments, a fully convolutional VGG is used as a baseline denoted as FC-VGG. It is compared against the recurrent version RFC-VGG. The initial five convolutional layers are not finetuned and are from the pretrained VGG architecture to avoid overfitting the data. The results of the experiments on SegTrackV2 and Densely Annotated VIdeo Segmentation (DAVIS) datasets are provided in 2. In these experiments, the data of each sequence is divided into two splits with half as training data and the other half as keep out test data. It is apparent in the results that RFC-VGG outperforms the FC-VGG architecture on both datasets with about 3% and 5% on DAVIS and SegTrack respectively.

Figure 4 shows the qualitative analysis of RFC-VGG against FC-VGG. It shows that utilizing temporal information through the recurrent unit gives better segmentation for the object. This can be explained due to the implicit learning of the motion that segmented objects undergo in the recurrent units. It also shows that using conv-GRU as the recurrent unit can enable the extraction of temporal information from feature maps due to the reduced parameter set. Thus the recurrent unit can learn the motion pattern of the segmented objects by working on richer information from these feature maps. It’s also worth noting that the performance of the RFCN network depends on its baseline fully convolutional counter part. Thus with an enhanced fully convolutional network using skip architecture, to get finer segmentation, the recurrent version should improve as well.

4.3 Further Analysis

In this section, we present our experiments using conventional recurrent layers for segmentation. These experiments provide further analysis on different recurrent units. They also compare end-to-end architecture versus the decoupled one. We use moving MNIST and change detection datasets for this part.
Images of MNIST dataset are relatively small (2828) which allows us to test our RFC-Lenet(a) network 1. A fully convolutional Lenet is compared against RFC-Lenet(a). Table 3 shows the results that were obtained. The results of RFC-Lenet with GRU is better than FC-Lenet with 2% improvement. But segmentation of MNIST characters is an easy task as it ends up to learning thresholding. Note also that GRU gave better results than LSTM.

Precision Recall F-measure
FC-Lenet 0.868 0.922 0.894
LSTM 0.941 0.786 0.856
GRU 0.955 0.877 0.914
RFC-Lenet 0.96 0.877 0.916
Table 3: Precision, Recall, and F-measure on FC-Lenet, LSTM, GRU, and RFC-Lenet tested on synthesized MNIST dataset

The second set of experiments is conducted on real data from the motion detection benchmark using RFC-12s. Throughout these experiments, the training constitutes 70% of the sequences and 30% for the test set. The experiments compared the recurrent convolutional network trained in an end-to-end fashion denoted as RFC-12s with its baseline, the fully convolutional network FC-12s. It is also compared against the decoupled training of the FC-12s and the recurrent unit. Where GRU is trained on the probability maps output from FC-12s. Table 4 shows the results of these experiments, where the RFC-12s network had a 1.4% improvement over FC-12s. We observe less relative improvement compared to using Conv-GRU because in regular GRU spatial connectivities are ignored. However, incorporating the temporal data still helped the segmentation accuracy.

Precision Recall F-measure
FC-12s 0.827 0.585 0.685
RFC-12s (D) 0.835 0.587 0.69
RFC-12s (EE) 0.797 0.623 0.7
Table 4: Precision, Recall, and F-measure on architectures FCN-12s, GRU pretraining on coarse map from FCN-12s, RFC-12s on six sequences from motion detection benchmark on the test set. (D) and (EE) indicate the decoupled and the end-to-end integration of recurrent units with the FCN, respectively.

5 Conclusion and Future Work

In this paper, we presented a novel approach in incorporating temporal information for video segmentation. This approach utilizes recurrent units on top of a fully convolutional network as a mean to include previously seen frames to decide the segmentation for the current frame. The paper also proposes using convolutional recurrent units (conv-gru in particular) for segmentation. We tested the method on both synthesized and real data for the segmentation task for different architectures. We showed that by having a recurrent layer after either probability map or feature map can improve the performance of the segmentation. The proposed architecture can process arbitrary length videos which makes it suitable for both batch and online scenarios.

For the future work, We want to extend our framework to semantic video segmentation using multiple recurrent layers for richer dynamic representation. We also want to apply our method on newer networks that are proposed for per image segmentation as in [26].