The recent trend in convolutional neural networks has dramatically changed the landscape in computer vision. The first task that was improved with this trend was object recognition. An even harder task that greatly progressed is semantic segmentation, which provides per pixel labelling as introduced in  . In  fully convolutional network was introduced. These networks yield a coarse segmentation map for any given image, and it is followed by upsampling within the network to get dense predictions. This method enabled an end-to-end training for the task of semantic segmentation of images. However, one missing element in this recent trend is that real-world is not a set of static images. Large portion of the information that we infer from the environment comes from motion. For example, in an activity recognition task, the difference between walking and standing is only profound if you consider a sequence of images.
The conventional Convolutional Neural Networks (CNN) are not designed to include temporal changes. The simplest way to include temporal information in CNN is to concatenate multiple frames and feed it as a single input. Small variations of this method are used for context classification on one million youtube videos. Surprisingly, it could not improve on single frame prediction by much which can indicate the inefficiency of this approach. Michalski et al. created a network which learns transformations between frames of a long video sequence. In 
convolutional restricted Boltzman machine is introduced which learns optical-flow-like features from input image sequence. Another proposed method
uses Recurrent Neural Networks (RNN) which have shown their power in different tasks. It introduces a combination of CNN and RNN for video segmentation. However, their RNN architecture is sensitive to initialization, and training is difficult due to the vanishing gradient problem. Their design does not allow usage of a pre-trained network and it cannot process large input images as the number of parameters in their network grows exponentially with the input size.
Several architectures are proposed that aim to solve the main bottleneck of recurrent networks namely, vanishing or exploding gradients. In Long Short Term Memory (LSTM) is presented. It is used for various applications such as generating sequences for text , dense image captioning and video captioning . Another recently proposed architecture is the Gated Recurrent Unit (GRU) . It was shown in 
that LSTM and GRU outperform other traditional recurrent architectures, and that GRU showed similar performance to LSTM but with reduced number of parameters. One problem with these previous architectures is that they only work with vectorized sequences as input. Thus, they are incapable of handling data where spatial information is critical like images or feature maps. Another work uses convolutional GRU for learning spatiotemporal features from videos that tackle this issue. In their work, experiments on video captioning and human action recognition were conducted.
The usage of fully convolutional networks (FCN) combined with recurrent gated units can solve many of the pitfalls of the previous approaches. In this paper, we present : (1) A novel architecture that can incorporate temporal data directly into FCN for video segmentation. We have chosen the Recurrent Neural Network as the foundation of our structure since it is shown to be effective in learning temporal dynamics. (2) An end-to-end training method for online video segmentation that does not need to process data offline. Overview of the suggested method is presented in Figure 1, where a sliding window over the frames is used and passed through the recurrent fully convolutional network(RFCN). To our knowledge, this is the first work that presents a recurrent fully convolutional network for pixel-wise labelling.
The paper is structured as follows. In Section 2 we will discuss the preliminary topics, then the proposed method will be presented in details in section 3. It is followed by the experiments section and discussion of the results in section 4. Finally, section 5 concludes the paper and presents potential future directions.
This section will review FCN and RNN which will be repeatedly referred to through the article.
2.1 Fully Convolutional Networks (FCN)
In convolutional neural networks that are used for classification, the few last fully connected layers are responsible for the classification part. But, with pixel-wise labelling, there is a need for dense predictions on all the pixels. In  the idea of using a fully convolutional neural network that is trained for pixel-wise semantic segmentation is presented. It is shown that it surpasses the state of the art in semantic segmentation on PASCAL VOC, NUYDv2, and SIFT Flow datasets. The FCN method is briefly discussed in what follows.
FCN architecture is based on VGG architecture due to its success in classification tasks. However, due to the fully connected layers that these networks have, they can only accept fixed size input and produce a classification label. To overcome this problem, it is possible to convert a fully connected layer into a convolutional layer. Accordingly, this network can yield coarse maps pixel wise prediction instead of one classification output.
In order to have dense prediction from this coarse map, it needs to be up-sampled to the original size of the input image. The up-sampling method can be a simple bi-linear interpolation. But in a new layer that applies upsampling within the network was presented. It makes it efficient to learn the up-sampling weights within the network using back-propagation. The filters of the deconvolution layer act as the basis to reconstruct the input image. Another idea for up-sampling is to stitch together output maps from shifted version of the input. But It was mentioned in  that using up-sampling with deconvolution is more effective. In  the idea of having a full deconvolution network with both deconvolution layers and unpooling layers is presented.
2.2 Recurrent Neural Networks
Recurrent Neural Networks are designed to incorporate sequential information into a neural network framework. These networks are capable of learning complex dynamics by utilizing a hidden unit in each recurrent cell. This unit works like a dynamic memory that can be changed based on the state that the unit is in. Accordingly, the process of each unit yields to two outcomes. Firstly, an output is computed from the current input and the hidden units values (the networks memory). Secondly, the network updates its memory based on, again, current input and hidden units value. The simplest recurrent unit can be modeled as1.
Here, is the hidden layer, is the input layer and is the output layer and
is the activation function.
Recurrent networks were successful in many tasks in speech recognition and text understanding but they come with their challenges. Unrestricted data flow between units causes problems with vanishing and exploding gradients . During the back propagation through recurrent units, the derivative of each node is dependent of all the nodes which processed earlier. This is shown in equations 2,3 and 4 where is the loss of the layer. To compute a series of multiplication from to is required. Assume that is bounded by then
A solution to this problem is to use gated structures. The gates can control back propagation flow between each node. Long-Short Term Memory  is the first such proposed architecture and it is still popular. A more recent architecture is Gated Recurrent Unit  which has simpler cells yet with competent performance .
2.2.1 Long Short Term Memory (LSTM)
As mentioned, LSTM uses a gated structure where each gate controls the flow of a particular signal. Each LSTM node has three gates that are input, output and forget gate each with learnable weights. These gates can learn the optimal way to remember useful information from previous states and decide the current state. In equations 5 the procedure of computing different gates and hidden states is shown, where , and are input, forget and output gates respectively. While denote the cell internal state, and is the hidden state.
2.2.2 Gated Recurrent Unit (GRU)
The Gated Recurrent Unit, similar to LSTM, utilizes a gated structure for flow-control. However, it has a simpler architecture which makes it both faster and less memory consuming. The model is shown in Figure 2 and described in 6 where , is the reset and update gate respectively. While is the hidden state.
GRU does not have direct control over memory content exposure while LSTM has it by having an output gate. These two are also different in the way that they update the memory nodes. LSTM updates its hidden state by summation over flow after input gate and forget gate. GRU however, assumes a correlation between how much to keep from the current state and how much to get from the previous state and it models this with the gate.
In an abstract view, we use a recurrent fully convolutional network (RFCN) that utilizes the temporal as well as spatial information for segmentation. The general theme in the design is to use recurrent nodes that combine fully convolutional network with a Recurrent unit(RU). The recurrent unit denotes either LSTM, GRU or Conv-GRU (which is explained in 3.2). In all of our networks, we aim for online segmentation in contrast to batch/offline version which needs the whole video as input. This is done by using a sliding window over the frames. Then, each window is propagated through the RFCN and yields a segmentation corresponding to the last frame in the sliding window. The recurrent layer can be employed to a sequence of feature maps or heat maps where each element in the series is the result of an image forward pass through an FCN network (Figure 1). The whole network is trained end-to-end using pixel-wise classification logarithmic loss. We designed different network architectures to this end using conventional and also convolutional recurrent unit 1.
3.1 Conventional Recurrent Unit for Segmentation
Our first architecture uses the Lenet network converted to a fully convolutional network as the base. Lenet is a well known and shallow network and as it is common, we used it for early experiments. We embed this model in recurrent node to allow the network to use temporal data. In Table 1, RFC-Lenet is delineating this architecture. The output of deconvolution inside the fcn is a 2D map of dense predictions that is then flattened into 1D vector as input to the recurrent unit. The recurrent unit takes a vector from each frame in the sliding window and outputs the segmentation of the last frame (Figure 1).
Note that utilizing the first architecture requires using large weight matrix in the RU layer since it operates on the flattened vectors of the full sized image. To lessen this problem, we can apply the deconvolution layer after the recurrent node. This will lead to our second architecture. The RU receives a vector of the flattened coarse map as its input and outputs the coarse predictions of the last frame in the sliding window. Then, deconvolution of this coarse map is used to get dense predictions. This proved to be useful with larger input images that lead to large number of parameters in the RU. Decreasing their size, allows the optimizer to search in a smaller state space and find a more generalized local minima in a faster training time. RFC-12s in the Table 1 is an example of this approach. This network is slightly modified version of the Lenet FCN. The main difference is that now the deconvolution comes as the last step after recursion being done.
3.2 Convolutional Gated Recurrent Unit (Conv-GRU) for Segmentation
Conventional recurrent units are capable of processing temporal data however, their architecture is not suitable for working on images/feature maps for two reasons. 1) weights matrix size, 2) ignoring spatial connectivity. Assume a case where a recurrent unit is placed after a feature map with the spatial size of and have a number of channels . After flattening, it will turn into a long matrix. Therefore, weights of the recurrent unit will be of size
which is power four of spatial dimension. These matrices for weights can only be maintained for small feature maps. Even if the computation was not an issue, such design introduces too much variance in the network which prevents generalization. In Convolutional recurrent units, similar to regular convolutional layer, weights are three dimensional and they convolve with the input instead of dot product. Accordingly, the cell’s model, in the case of a GRU architecture, will turn into equations7 where the dot products are replaced with convolutions. In this design, weights matrices are of size where , , and are kernel’s height, kernel’s width, number of input channels, and number of filters, respectively. In Figure 2 the operations applied on the input and the previous step will all be convolutions instead. Since we can assume spatial connectivity in feature maps, kernel size can be very small compared to feature map’s spatial size. Therefore, this architecture is much more efficient and weights are easier to learn due to smaller search space.
We employ this approach for segmentation in a fully convolutional network. It is possible to apply this layer on either heat maps or feature maps. In the first case, the output of this layer will directly feed into the deconvolution layer and produces the pixel-wise probability map. In the latter case, at least one CNN layer needs to be used after the recurrent layer to convert its output feature maps to a heat map.
RFC-VGG in the Table 1 is an example of the second case. It is based on VGG-F 
network. Initializing weights of our filters by VGG-F trained weights, alleviates over-fitting problems as these weights are the result of extensive training on the imagenet. The network is cast to a fully convolutional one by replacing the fully connected layers with convolutional layers. The last two pooling layers are dropped from VGG-F to allow a finer segmentation. Then a convolutional gated recurrent unit is used followed by one convolutional layer and then deconvolution for up-sampling. Figure3 shows the detailed architecture of RFC-VGG.
This section presents our experiments and results. First, we describe the datasets that we used then, we discuss our training methods and hyper-parameters settings. Finally, quantitative and qualitative results are shown.
|input: 2828||input: 120180||input: 240360|
|Conv: F(5), P(10), D(20)||
|Conv: F(5), S(3), P(10), D(20)||
|Conv: F(11), S(4), P(40), D(64)|
|Pool 22||Pool 22||Pool 33|
|Conv: F(5), D(50)||Conv: F(5), D(50)||Conv: F(5), P(2) D(256)|
|Conv: F(3), D(500)||Conv: F(3), D(500)||Conv: F(3), P(1) D(256)|
|Conv: F(1), D(1)||Conv: F(1), D(1)||Conv: F(3), P(1) D(256)|
|Conv: F(3), P(1) D(256)|
|Conv: F(3), D(512)|
|Conv: F(3), D(128)|
|DeConv: F(10), S(4)||Flatten||ConvGRU: F(3), D(128)|
|Flatten||GRU: W(100100)||Conv: F(1), D(1)|
|GRU: W(784784)||DeConv: F(10), S(4)||DeConv: F(20), S(8)|
zero padding around the feature map.
denotes stride of lengthfor the convolution. denotes number of output feature maps from a particular layer for a layer (number of feature maps is same as previous layer if is not mentioned).
All the experiments are performed on our own implemented library. To our knowledge, there is no open source framework that accommodates RFCNN architecture. Therefore, we built our own library on top of Theano to create arbitrary networks that have can have FCN as a recursive module. Key features of this implementation are: (1) Supports networks with the temporal operation for images. The architecture can be any arbitrary CNN and an arbitrary number of recurrent layers. The network supports any length input. (2) Three gated architecture, LSTM, GRU, and Conv-GRU is available for the recurrent layer. (3) Deconvolution layer and skip architecture to support segmentation with FCN.
In this paper four datasets are used: 1) Moving MNIST. 2) Change detection. 3) Segtrack version 2. 4) Densely Annotated VIdeo Segmentation (Davis) . Figure 4 shows samples from the latter two.
Moving MNIST dataset is synthesized from original MNIST by moving the characters in random but consistent directions. The labels for segmentation is generated by thresholding input images after translation. We consider each translated image as a new frame. Therefore we can have arbitrary length sequence of images.
Change Detection Dataset This dataset provides realistic, diverse set of videos with pixel-wise labeling of moving objects. The dataset includes both indoor and outdoor scenes. It focuses on moving object segmentation. In the motion detection, we were looking for videos that have similar moving objects, e.g. cars or humans so that there would be semantic correspondences among sequences. Accordingly, we chose six videos: Pedestrians, PETS2006, Badminton, CopyMachine, Office, and Sofa.
SegTrack V2 is a collection of fourteen video sequences with objects of interest manually segmented. The dataset has sequences with both single or multiple objects. In the latter case, we consider all the segmented objects as one and we perform foreground segmentation.
Davis dataset has fifty high resolution and densely annotated videos with pixel accurate groundtruth. The videos include multiple challenges such as occlusions, fast motion, nonlinear deformation and motion blur.
The main experiments are conducted using Adadelta 
for optimization that practically gave much faster convergence than standard stochastic gradient descent. The logistic loss function is used and the maximum number of epochs used for the training is 500. The evaluation metrics used are precision, recall, F-measure and IoU, shown in equation8 and 10 where tp, fp, fn denote true positives, false positives, and false negatives respectively.
In this set of experiments, a fully convolutional VGG is used as a baseline denoted as FC-VGG. It is compared against the recurrent version RFC-VGG. The initial five convolutional layers are not finetuned and are from the pretrained VGG architecture to avoid overfitting the data. The results of the experiments on SegTrackV2 and Densely Annotated VIdeo Segmentation (DAVIS) datasets are provided in 2. In these experiments, the data of each sequence is divided into two splits with half as training data and the other half as keep out test data. It is apparent in the results that RFC-VGG outperforms the FC-VGG architecture on both datasets with about 3% and 5% on DAVIS and SegTrack respectively.
Figure 4 shows the qualitative analysis of RFC-VGG against FC-VGG. It shows that utilizing temporal information through the recurrent unit gives better segmentation for the object. This can be explained due to the implicit learning of the motion that segmented objects undergo in the recurrent units. It also shows that using conv-GRU as the recurrent unit can enable the extraction of temporal information from feature maps due to the reduced parameter set. Thus the recurrent unit can learn the motion pattern of the segmented objects by working on richer information from these feature maps. It’s also worth noting that the performance of the RFCN network depends on its baseline fully convolutional counter part. Thus with an enhanced fully convolutional network using skip architecture, to get finer segmentation, the recurrent version should improve as well.
4.3 Further Analysis
In this section, we present our experiments using conventional recurrent layers for segmentation. These experiments provide further analysis on different recurrent units. They also compare end-to-end architecture versus the decoupled one. We use moving MNIST and change detection datasets for this part.
Images of MNIST dataset are relatively small (2828) which allows us to test our RFC-Lenet(a) network 1. A fully convolutional Lenet is compared against RFC-Lenet(a). Table 3 shows the results that were obtained. The results of RFC-Lenet with GRU is better than FC-Lenet with 2% improvement. But segmentation of MNIST characters is an easy task as it ends up to learning thresholding. Note also that GRU gave better results than LSTM.
The second set of experiments is conducted on real data from the motion detection benchmark using RFC-12s. Throughout these experiments, the training constitutes 70% of the sequences and 30% for the test set. The experiments compared the recurrent convolutional network trained in an end-to-end fashion denoted as RFC-12s with its baseline, the fully convolutional network FC-12s. It is also compared against the decoupled training of the FC-12s and the recurrent unit. Where GRU is trained on the probability maps output from FC-12s. Table 4 shows the results of these experiments, where the RFC-12s network had a 1.4% improvement over FC-12s. We observe less relative improvement compared to using Conv-GRU because in regular GRU spatial connectivities are ignored. However, incorporating the temporal data still helped the segmentation accuracy.
5 Conclusion and Future Work
In this paper, we presented a novel approach in incorporating temporal information for video segmentation. This approach utilizes recurrent units on top of a fully convolutional network as a mean to include previously seen frames to decide the segmentation for the current frame. The paper also proposes using convolutional recurrent units (conv-gru in particular) for segmentation. We tested the method on both synthesized and real data for the segmentation task for different architectures. We showed that by having a recurrent layer after either probability map or feature map can improve the performance of the segmentation. The proposed architecture can process arbitrary length videos which makes it suitable for both batch and online scenarios.
For the future work, We want to extend our framework to semantic video segmentation using multiple recurrent layers for richer dynamic representation. We also want to apply our method on newer networks that are proposed for per image segmentation as in .
-  Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.
-  Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
-  Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. Neural Networks, IEEE Transactions on, 5(2):157–166, 1994.
-  Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.
-  Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach,
Subhashini Venugopalan, Kate Saenko, and Trevor Darrell.
Long-term recurrent convolutional networks for visual recognition and
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.
-  Nil Goyette, Pierre-Marc Jodoin, Fatih Porikli, Janusz Konrad, and Prakash Ishwar. Changedetection. net: A new change detection benchmark dataset. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 1–8. IEEE, 2012.
-  Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
-  Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. arXiv preprint arXiv:1511.07571, 2015.
-  Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013.
-  Guosheng Lin, Chunhua Shen, Ian Reid, et al. Efficient piecewise training of deep structured models for semantic segmentation. arXiv preprint arXiv:1504.01013, 2015.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
-  Vincent Michalski, Roland Memisevic, and Kishore Konda. Modeling deep temporal dependencies with recurrent grammar cells””. In Advances in neural information processing systems, pages 1925–1933, 2014.
-  Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1520–1528, 2015.
-  Mircea Serban Pavel, Hannes Schulz, and Sven Behnke. Recurrent convolutional neural networks for object-class segmentation of rgb-d video. In Neural Networks (IJCNN), 2015 International Joint Conference on, pages 1–8. IEEE, 2015.
-  F Perazzi, J Pont-Tuset1 B McWilliams, L Van Gool, M Gross, and A Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Ilya Sutskever, James Martens, and Geoffrey E Hinton.
Generating text with recurrent neural networks.
Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017–1024, 2011.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
-  Graham W Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. Convolutional learning of spatio-temporal features. In Computer Vision–ECCV 2010, pages 140–153. Springer, 2010.
-  Oriol Vinyals, Suman V Ravuri, and Daniel Povey. Revisiting recurrent neural networks for robust asr. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on, pages 4085–4088. IEEE, 2012.
-  Francesco Visin, Kyle Kastner, Aaron Courville, Yoshua Bengio, Matteo Matteucci, and Kyunghyun Cho. Reseg: A recurrent neural network for object segmentation. arXiv preprint arXiv:1511.07053, 2015.
-  Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan Lu. Visual tracking with fully convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 3119–3127, 2015.
-  Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
-  Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.