We propose an approach to learn spatio-temporal features in videos from intermediate visual representations we call "percepts" using Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts that are extracted from all level of a deep convolutional network trained on the large ImageNet dataset. While high-level percepts contain highly discriminative information, they tend to have a low-spatial resolution. Low-level percepts, on the other hand, preserve a higher spatial resolution from which we can model finer motion patterns. Using low-level percepts can leads to high-dimensionality video representations. To mitigate this effect and control the model number of parameters, we introduce a variant of the GRU model that leverages the convolution operations to enforce sparse connectivity of the model units and share parameters across the input spatial locations. We empirically validate our approach on both Human Action Recognition and Video Captioning tasks. In particular, we achieve results equivalent to state-of-art on the YouTube2Text dataset using a simpler text-decoder model and without extra 3D CNN features.READ FULL TEXT VIEW PDF
Video analysis and understanding represents a major challenge for computer vision and machine learning research. While previous work has traditionally relied on hand-crafted and task-specific representations(Wang et al., 2011; Sadanand & Corso, 2012), there is a growing interest in designing general video representations that could help solve tasks in video understanding such as human action recognition, video retrieval or video captionning (Tran et al., 2014).
Two-dimensional Convolutional Neural Networks (CNN) have exhibited state-of-art performance in still image tasks such as classification or detection(Simonyan & Zisserman, 2014b). However, such models discard temporal information that has been shown to provide important cues in videos (Wang et al., 2011)
. On the other hand, recurrent neural networks (RNN) have demonstrated the ability to understand temporal sequences in various learning tasks such as speech recognition(Graves & Jaitly, 2014) or machine translation (Bahdanau et al., 2014). Consequently, Recurrent Convolution Networks (RCN) (Srivastava et al., 2015; Donahue et al., 2014; Ng et al., 2015) that leverage both recurrence and convolution have recently been introduced for learning video representation. Such approaches typically extract “visual percepts” by applying a 2D CNN on the video frames and then feed the CNN activations to an RNN in order to characterize the video temporal variation.
Previous works on RCNs has tended to focus on high-level visual percepts extracted from the 2D CNN top-layers. CNNs, however, hierarchically build-up spatial invariance through pooling layers (LeCun et al., 1998; Simonyan & Zisserman, 2014b) as Figure 2 highlights. While CNNs tends to discard local information in their top layers, frame-to-frame temporal variation is known to be smooth. The motion of video patches tend to be restricted to a local neighborhood (Brox & Malik, 2011). For this reason, we argue that current RCN architectures are not well suited for capturing fine motion information. Instead, they are more likely focus on global appearance changes such as shot transitions. To address this issue, we introduce a novel RCN architecture that applies an RNN not solely on the 2D CNN top-layer but also on the intermediate convolutional layers. Convolutional layer activations, or convolutional maps, preserve a finer spatial resolution of the input video from which local spatio-temporal patterns are extracted.
Applying an RNN directly on intermediate convolutional maps, however, inevitably results in a drastic number of parameters characterizing the input-to-hidden transformation due to the convolutional maps size. On the other hand, convolutional maps preserve the frame spatial topology. We propose to leverage this topology by introducing sparsity and locality in the RNN units to reduce the memory requirement. We extend the GRU-RNN model (Cho et al., 2014) and replace the fully-connected RNN linear product operation with a convolution. Our GRU-extension therefore encodes the locality and temporal smoothness prior of videos directly in the model structure.
We evaluate our solution on UCF101 human action recognition from Soomro et al. (2012) as well as the YouTube2text video captioning dataset from Chen & Dolan (2011). Our experiments show that leveraging “percepts” at multiple resolutions to model temporal variation improves performance over our baseline model with respective gains of for action recognition and for video captioning.
In this section, we review Gated-Recurrent-Unit (GRU) networks which are a particular type of RNN. An RNN model is applied to a sequence of inputs, which can have variable lengths. It defines a recurrent hidden state whose activation at each time is dependent on that of the previous time. Specifically, given a sequence , the RNN hidden state at time is defined as , where
is a nonlinear activation function. RNNs are known to be difficult to train due to the exploding or vanishing gradient effect(Bengio et al., 1994)
. However, variants of RNNs such as Long Short Term Memory (LSTM)(Hochreiter & Schmidhuber, 1997) or Gated Recurrent Units (GRU) (Cho et al., 2014) have empirically demonstrated their ability to model long-term temporal dependency in various task such as machine translation or image/video caption generation. In this paper, we will mainly focus on GRU networks as they have shown similar performance to LSTMs but with a lower memory requirement (Chung et al., 2014).
GRU networks allow each recurrent unit to adaptively capture dependencies of different time scales. The activation of the GRU is defined by the following equations:
where is an element-wise multiplication. is an update gate that decides the degree to which the unit updates its activation, or content. is a reset gate.
is the sigmoid function. When a unitis close to 0, the reset gate forgets the previously computed state, and makes the unit act as if it is reading the first symbol of an input sequence. is a candidate activation which is computed similarly to that of the traditional recurrent unit in an RNN.
This section delves into the main contributions of this work. We aim at leveraging visual percepts from different convolutional levels in order to capture temporal patterns that occur at different spatial resolution.
Let’s consider , a set of 2D convolutional maps extracted from layers at different time steps in a video. We propose two alternative RCN architectures, GRU-RCN, and Stacked-GRU-RCN (illustrated in Figure 2) that combines information extracted from those convolutional maps.
In the first RCN architecture, we propose to apply RNNs independently on each convolutional map. We define RNNs as , such that
. The hidden representation of the final time stepare then fed to a classification layer in the case of action recognition, or to a text-decoder RNN for caption generation.
To implement the RNN recurrent function , we propose to leverage Gated Recurrent Units (Cho et al., 2014)
. GRUs were originally introduced for machine translation. They model input to hidden-state and hidden to hidden transitions using fully connected units. However, convolutional map inputs are 3D tensors (spatial dimension and input channels). Applying a GRU directly can lead to a drastic number of parameters. Let, and be the input convolutional map spatial size and number of channels. Applying a GRU directly would require input-to-hidden parameters , and to be of size where is the dimensionality of the GRU hidden representation.
Fully-connected GRUs do not take advantage of the underlying structure of convolutional maps. Indeed, convolutional maps are extracted from images that are composed of patterns with strong local correlation which are repeated over different spatial locations. In addition, videos have smooth temporal variation over time, i.e. motion associated with a given patch in successive frames will be restricted in a local spatial neighborhood. We embed such a prior in our model structure and replace the fully-connected units in GRU with convolution operations. We therefore obtain recurrent units that have sparse connectivity and share their parameters across different input spatial locations:
where denotes a convolution operation. In this formulation, Model parameters and are 2D-convolutional kernels. Our model results in hidden recurrent representation that preserves the spatial topology, where
is a feature vector defined at the location
. To ensure that the spatial size of the hidden representation remains fixed over time, we use zero-padding in the recurrent convolutions.
Using convolution, parameters , and have a size of where is the convolutional kernel spatial size (usually ), chosen to be significantly lower than convolutional map size . The candidate hidden representation , the activation gate and the reset gate are defined based on a local neigborhood of size at the location in both the input data and the previous hidden-state . In addition, the size of receptive field associated with increases in the previous presentation as we go back further in time. Our model is therefore capable of characterizing spatio-temporal patterns with high spatial variation in time.
A GRU-RCN layer applies 6 2D-convolutions at each time-step (2 per GRU gate and 2 for computing the candidate activation). If we assume for simplicity that the input-to-hidden and hidden-to-hidden convolutions have the same kernel size and perserve the input dimension, GRU-RCN requires multiplications. GRU-RCN sparse connectivity therefore saves computation compared to a fully-connected RNN that would require computations. Memorywise, GRU-RCN needs to store the parameters for all 6 convolutions kernels leading to parameters.
In the second RCN architecture, we investigate the importance of bottom-up connection across RNNs. While GRU-RCN applies each layer-wise GRU-RNN in an independent fashion, Stacked GRU-RCN preconditions each GRU-RNN on the output of the previous GRU-RNN at the current time step: . The previous RNN hidden representation is given as an extra-input to the GRU convolutional units:
Adding this extra-connection brings more flexibility and gives the opportunity for the model to leverage representations with different resolutions.
Deep learning approaches have recently been used to learn video representations and have produced state-of-art results (Karpathy et al., 2014; Simonyan & Zisserman, 2014a; Wang et al., 2015b; Tran et al., 2014). Karpathy et al. (2014); Tran et al. (2014) proposed to use 3D CNN learn a video representations, leveraging large training datasets such as the Sport 1 Million. However, unlike image classification (Simonyan & Zisserman, 2014b), CNNs did not yield large improvement over these traditional methods (Lan et al., 2014) highlighting the difficulty of learning video representations even with large training dataset. Simonyan & Zisserman (2014a) introduced a two-stream framework where they train CNNs independently on RGB and optical flow inputs. While the flow stream focuses only on motion information, the RGB stream can leverage 2D CNN pre-trained on image datasets. Based on the Two Stream representation, Wang et al. (2015a)
extracted deep feature and conducted trajectory constrained pooling to aggregate convolutional feature as video representations.
RNN models have also been used to encode temporal information for learning video representations in conjonction with 2D CNNs. Ng et al. (2015); Donahue et al. (2014) applied an RNN on top of the the two-stream framework, while Srivastava et al. (2015) proposed, in addition, to investigate the benefit of learning a video representation in an unsupervised manner. Previous works on this topic has tended to focus only on high-level CNN “visual percepts”. In contrast, our approach proposes to leverage visual “percepts” extracted from different layers in the 2D-CNN.
Recently, Shi et al. (2015) also proposed to leverage convolutional units inside an RNN network. However, they focus on different task (now-casting) and a different RNN model based on an LSTM. In addition, they applied their model directly on pixels. Here, we use recurrent convolutional units on pre-trained CNN convolutional maps, to extract temporal pattern from visual “percepts” with different spatial sizes.
This section presents an empirical evaluation of the proposed GRU-RCN and Stacked GRU-RCN architectures. We conduct experimentations on two different tasks: human action recognition and video caption generation.
We evaluate our approach on the UCF101 dataset Soomro et al. (2012)
. This dataset has 101 action classes spanning over 13320 YouTube videos clips. Videos composing the dataset are subject to large camera motion, viewpoint change and cluttered backgrounds. We report results on the dataset UCF101 first split, as this is most commonly used split in the literature. To perform proper hyperparameter seach, we use the videos from the UCF-Thumos validation splitJiang et al. (2014) as the validation set.
In this experiment, we consider the RGB and flow representations of videos as inputs. We extract visual “percept” using VGG-16 CNNs that consider either RGB or flow inputs. VGG-16 CNNs are pretrained on ImageNet (Simonyan & Zisserman, 2014b) and fine-tuned on the UCF-101 dataset, following the protocol in Wang et al. (2015b). We then extract the convolution maps from pool2, pool3, pool4, pool5 layers and the fully-connected map from layer fc-7 (which can be view as a feature map with a spatial dimension). Those features maps are given as inputs to our RCN models.
We design and evaluate three RCN architectures for action recognition. In the first RCN architecture, GRU-RCN, we apply 5 convolutional GRU-RNNs independently on each convolutional map. Each convolution in the GRU-RCN has zero-padded convolutions that preserves the spatial dimension of the inputs . The number of channels of each respective GRU-RNN hidden-representations are , , , , . After the RCN operation we obtain hidden-representations for each time step. We apply average pooling on the hidden-representations of the last time-step to reduce their spatial dimension to , and feed the representations to classifiers, composed by a linear layer with a softmax nonlineary. Each classifier therefore focuses on only 1 hidden-representation extracted from the convolutional map of a specific layer. The classifier outputs are then averaged to get the final decision. A dropout ratio of is applied on the input of each classifiers.
In the second RCN architecture, Stacked GRU-RCN, we investigate the usefulness of bottom-up connections. Our stacked GRU-RCN uses the same base architecture as the GRU-RCN, consisting of 5 convolutional GRU-RNNs having
channels respectively. However, each convolutional GRU-RNN is now preconditioned on the hidden-representation that the GRU-RNN applied on the previous convolution-map outputs. We apply max-pooling on the hidden representations between the GRU-RNN layers for the compatibility of the spatial dimensions. As for the previous architecture, each GRU-RNN hidden-representation at the last time step is pooled and then given as input to a classifier.
Finally, in our bi-directional GRU-RCN, we investigate the importance of reverse temporal information. Given convolutional maps extracted from one layer, we run the GRU-RCN twice, considering the inputs in both sequential and reverse temporal order. We then concatenate the last hidden-representations of the foward GRU-RCN and backward GRU-RCN, and give the resulting vector to a classifier.
We follow the training procedure introduced by the two-stream framework Simonyan & Zisserman (2014a). At each iteration, a batch of 64 videos are sampled randomly from the the training set. To perform scale-augmentation, we randomly sample the cropping width and height from . The temporal cropping size is set to . We then resize the cropped volume to
. We estimate each model parameters by maximizing the model log-likelihood:
where there are training video-action pairs , is a function that takes a crop at random. We use Adam Kingma & Ba (2014)
We also follow the evaluation protocol of the two-stream framework Simonyan & Zisserman (2014a). At the test time, we sample 25 equally spaced video sub-volumes with a temporal size of 10 frames. From each of these selected sub-volumes, we obtain 10 inputs for our model, i.e. 4 corners, 1 center, and their horizontal flipping. The final pre- diction score is obtained by averaging across the sampled sub-volumes and their cropped regions.
|Two-Stream Simonyan & Zisserman (2014b)||72.8||81.2|
|Two-Stream + LSTM Donahue et al. (2014)||71.1||76.9|
|Two-Stream + LSTM + Unsupervised Srivastava et al. (2015)||77.7||83.7|
|Improved Two-Stream Wang et al. (2015b)||79.8||85.7|
|C3D one network Tran et al. (2014), 1 million videos as training||82.3||-|
|C3D ensemble Tran et al. (2014), 1 million videos as training||85.2||-|
|Deep networks Karpathy et al. (2014), 1 million videos as training||65.2||-|
We compare our approach with two different baselines, VGG-16 and VGG-16 RNN. VGG-16 is the 2D spatial stream that is described in Wang et al. (2015b). We take the VGG-16 model, pretrained on Image-Net and fine-tune it on the UCF-101 dataset. VGG-16 RNN baseline applied an RNN, using fully-connected gated-recurrent units, on top-of VGG-16. It takes as input the VGG-16 fully-connected representation fc-7. Following GRU-RCN top-layer, the VGG-16 RNN has hidden-representation dimensionality of .
The first column of Table 1 focuses on RGB inputs. We first report results of different GRU-RCN variants and compare them with the two baselines: VGG-16 and VGG-16 RNN. Our GRU-RCN variants all outperform the baselines, showing the benefit of delving deeper into a CNN in order to learn a video representation. We notice that VGG-16 RNN only slightly improve over the VGG-16 baseline, against . This result confirms that CNN top-layer tends to discard temporal variation over short temporal windows. Stacked-GRU RCN performs significantly lower than GRU-RCN and Bi-directional GRU-RCN. We argue that bottom-up connection, increasing the depth of the model, combined with the lack of training data (UCF-101 is train set composed by only 9500 videos) make the Stacked-GRU RCN learning difficult. The bi-directional GRU-RCN performs the best among the GRU-RCN variant with an accuracy of , showing the advantage of modeling temporal information in both sequential and reverse order. Bi-directional GRU-RCN obtains a gain in term of performances, relatively to the baselines that focus only the VGG-16 top layer.
Table 1 also reports results from other state-of-art approaches using RGB inputs. C3D Tran et al. (2014) obtains the best performance on UCF-101 with . However, it should be noted that C3D is trained over 1 million videos. Other approaches use only the 9500 videos of UCF101 training set for learning temporal pattern. Our Bi-directional GRU-RCN compare favorably with other Recurrent Convolution Network (second blocks), confirming the benefit of using different CNN layers to model temporal variation.
Table 1 also evaluates the GRU-RCN model applied flow inputs. VGG-16 RNN baseline actually decreases the performance compared to the VGG-16 baseline. On the other hand, GRU-RCN outperforms the VGG-16 baseline achieving against . While the improvement is less important than the RGB stream, it should be noted that the flow stream of VGG-16 is applied on 10 consecutive flow inputs to extract visual “percepts”, and therefore already captures some motion information.
Finally, we investigate the combination of the RGB and flow streams. Following Wang et al. (2015b), we use a weighted linear combination of their prediction scores, where the weight is set to as for the flow stream net and for the temporal stream. Fusion the VGG-16 model baseline achieve an accuracy of . Combining the RGB Bi-directional GRU-RCN with the flow GRU-RCN achieves a performance gain of over baseline, reaching . Our model is on part with Wang et al. (2015b) that obtain state-of-art results using both RGB and flow streams which obtains .
We also evaluate our representation on the video captioning task using YouTube2Text video corpus Chen & Dolan (2011). The dataset has 1,970 video clips with multiple natural language descriptions for each video clip. The dataset is open-domain and covers a wide range of topics such as sports, animals, music and movie clips. Following Yao et al. (2015b), we split the dataset into a training set of 1,200 video clips, a validation set of 100 clips and a test set consisting of the remaining clips.
To perform video captioning, we use the so-called encoder-decoder framework Cho et al. (2014). In this framework the encoder maps input videos into abstract representations that precondition a caption-generating decoder.
As for encoder, we compare both VGG-16 CNN and Bi-directional GRU-RCN. Both models have been fine-tuned on the UCF-101 dataset and therefore focus on detecting actions. To extract an abstract representation from a video, we sample equally-space segments. When using the VGG-16 encoder, we provide the layer activations of the each segment’s first frame as the input to the text-decoder. For the GRU-RCN, we apply our model on the segment’s 10 first frames. We concatenate the GRU-RCN hidden-representation from the last time step. The concatenated vector is given as the input to the text decoder. As it has been shown that characterizing entities in addition of action is important for the caption-generation task Yao et al. (2015a), we also use as encoder a CNN Szegedy et al. (2014), pretrained on ImageNet, that focuses on detecting static visual object categories.
As for the decoder, we use an LSTM text-generator with soft-attention on the video temporal frames Yao et al. (2015b).
For all video captioning models, we estimated the parameters of the decoder by maximizing the log-likelihood:
where there are training video-description pairs , and each description is words long. We used Adadelta Zeiler (2012) We optimized the hyperparameters (e.g. number of LSTM units and the word embedding dimensionality, number of segment ) using random search (Bergstra & Bengio, 2012) to maximize the log-probability of the validation set.
|Bi-directional GRU-RCN Encoder||BLEU||0.4100||0.2850||0.5010|
|GoogleNet + Bi-directional GRU-RCN Encoder||BLEU||0.4963||0.3075||0.5937|
|GoogleNet + Bi-directional GRU-RCN Encoder||NLL||0.4790||0.3114||0.6782|
|GoogleNet + Bi-directional GRU-RCN Encoder||METEOR||0.4842||0.3170||0.6538|
|GoogleNet + Bi-directional GRU-RCN Encoder||CIDEr||0.4326||0.3160||0.6801|
|GoogleNet + HRNE (Pan et al., 2015)||-||0.436||0.321||-|
|VGG + p-RNN (Yu et al., 2015)||-||0.443||0.311||-|
|VGG + C3D + p-RNN (Yu et al., 2015)||-||0.499||0.326||-|
|Soft-attention Yao et al. (2015b)||-||0.4192||0.2960||0.5167|
|Venugopalan et al. Venugopalan et al. (2015)||-||0.3119||0.2687||-|
|+ Extra Data (Flickr30k, COCO)||-||0.3329||0.2907||-|
|Thomason et al. Thomason et al. (2014)||-||0.1368||0.2390||-|
reports the performance of our proposed method using three automatic evaluation metrics. These are BLEU inPapineni et al. (2002), METEOR in Denkowski & Lavie (2014) and CIDEr in Vedantam et al. (2014). We use the evaluation script prepared and introduced in Chen et al. (2015). All models are early-stopped based on the negative-log-likelihood (NLL) of the validation set. We then select the model that performs best on the validation set according to the metric at consideration.
The first two lines of Table 2 compare the performances of the VGG-16 and Bi-directional GRU-RCN encoder. Results clearly show the superiority of the Bi-Directional GRU-RCN Encoder as it outperforms the VGG-16 Encoder on all three metrics. In particular, GRU-RCN Encoder obtains a performance gain of compared to the VGG-16 Encoder according to the BLEU metric. Combining our GRU-RCN Encoder that focuses on action with a GoogleNet Encoder that captures visual entities further improve the performances.
Our GoogleNet + Bi-directional GRU-RCN approach significantly outperforms Soft-attention Yao et al. (2015b) that relies on a GoogLeNet and cuboids-based 3D-CNN Encoder, in conjunction to a similar soft-attention decoder. This result indicates that our approach is able to offer more effective representations. According to the BLEU metric, we also outperform other approaches using more complex decoder schemes such as spatial and temporal attention decoder (Yu et al., 2015) or a hierarchical RNN decoder (Pan et al., 2015) Our approach is on par with Yu et al. (2015), without the need of using a C3D-encoder that requires training on large-scale video dataset.
In this work, we address the challenging problem of learning discriminative and abstract representations from videos. We identify and underscore the importance of modeling temporal variation from “visual percepts” at different spatial resolutions. While high-level percepts contain highly discriminative information, they tend to have a low-spatial resolution. Low-level percepts, on the other hand, preserve a higher spatial resolution from which we can model finer motion patterns. We introduce a novel recurrent convolutional network architecture that leverages convolutional maps, from all levels of a deep convolutional network trained on the ImageNet dataset, to take advantage of “percepts” from different spatial resolutions.
We have empirically validated our approach on the Human Action Recognition and Video Captioning tasks using the UCF-101 and YouTube2Text datasets. Experiments demonstrate that leveraging “percepts” at multiple resolutions to model temporal variation improve over our baseline model, with respective gain of and for the action recognition and video captions tasks using RGB inputs. In particular, we achieve results comparable to state-of-art on YouTube2Text using a simpler text-decoder model and without extra 3D CNN features.
The authors would like to acknowledge the support of the following agencies for research funding and computing support: NSERC, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. We would also like to thank the developers of Theano(Bergstra et al., 2010; Bastien et al., 2012) , for developing such a powerful tool for scientific computing.