Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. In this paper, we propose an unsupervised temporal modeling method that learns from untrimmed videos. The speed of motion varies constantly, e.g., a man may run quickly or slowly. We therefore train a Multirate Visual Recurrent Model (MVRM) by encoding frames of a clip with different intervals. This learning process makes the learned model more capable of dealing with motion speed variance. Given a clip sampled from a video, we use its past and future neighboring clips as the temporal context, and reconstruct the two temporal transitions, i.e., present→past transition and present→future transition, reflecting the temporal information in different views. The proposed method exploits the two transitions simultaneously by incorporating a bidirectional reconstruction which consists of a backward reconstruction and a forward reconstruction. We apply the proposed method to two challenging video tasks, i.e., complex event detection and video captioning, in which it achieves state-of-the-art performance. Notably, our method generates the best single feature for event detection with a relative improvement of 10.4 achieves the best performance in video captioning across all evaluation metrics on the YouTube2Text dataset.READ FULL TEXT VIEW PDF
Temporal information plays a key role in video representation modeling. In earlier years, hand-crafted features, e.g., Dense Trajectories (DT) and improved Dense Trajectories (iDT) [45, 46], use local descriptors along trajectories to model video motion structures. Despite achieving promising performance, DT and iDT are very expensive to extract, due to the heavy computational cost of optical flows and it takes about a week to extract iDT features for 8,000 hours of web videos using 1,000 CPU cores . Deep visual features have recently achieved significantly better performance in image classification and detection tasks than hand-crafted features at an efficient processing speed [22, 13, 11]
. However, learning a video representation on top of deep Convolutional Neural Networks (ConveNets) remains a challenging problem. Two-stream ConvNet is groundbreaking in learning video motion structures over short video clips. Although it achieves comparable performance to iDT for temporally trimmed videos, two-stream ConvNet still needs to extract optical flows. The heavy cost severely limits the utility of methods based on optical flows, especially in the case of large scale video data.
Extending 2D ConvNet to 3D, C3D ConvNet has been demonstrated to be effective for spatio-temporal modeling and it avoids extracting optical flows. However, it can only model temporal information in short videos, usually of 16 frames 15, 26] and a modified Hierarchical Recurrent Neural Encoder (HRNE) , have been used to model temporal information in videos. A major limitation of  and  is that the input frames are encoded with a fixed sampling rate when training the RNNs. On the other hand, the motion speed of videos varies even in the same video. As shown in the Figure 1, there is almost no apparent motion in the first four frames, but fast motion is observed in the last three frames. The encoding rate should be correspondingly low for the first four frames, but high for the last three, as indicated by the solid arrow. The fixed rate strategy, however, is redundant for the first four frames, while important information for the last three frames is lost. The gap between the fixed encoding rate and motion speed variance in real world videos may degrade performance, especially when the variance is extensive.
Notwithstanding the appealing ability of end-to-end approaches for learning a discriminative feature, such approaches require a large amount of labeled data to achieve good performance with plausible generalization capabilities. Compared to images, a large number of videos are very expensive to label by humans. For example, the largest public human-labeled video dataset (ActivityNet) 
only has 20,000 labeled videos while the ImageNet dataset has over one million labeled instances. Temporal ConvNet trained on the UCF-101 dataset  with about 10,000 temporally trimmed videos did not generalize well on temporally a untrimmed dataset . Targeting short video clips, Srivastava et al. 
proposed training a composite autoencoder in an unsupervised manner to learn video temporal structures, essentially by predicting future frames and reconstructing present frames. Inspired by a recent study on neuroscience which shows that a common brain network underlies the capacity both to remember the past and imagine the future, we consider reconstructing two temporal transitions, i.e., presentpast transition and presentfuture transition. Importantly, video motion speed changes constantly in untrimmed videos and Srivastava et al. directly used an LSTM with a single fixed sampling rate, making it vulnerable to motion speed variance.
In this paper, we propose an unsupervised method to learn from untrimmed videos for temporal information modeling without the heavy cost of computing optical flows. It makes the following two major contributions. First, our Multirate Visual Recurrent Model adopts multiple encoding rates, and together with the reading gate and the updating gate in the Gated Recurrent Unit, it enables communication between different encoding rates and collaboratively learns a multirate representation which is robust to motion speed variance in videos. Second, we leverage the mutual benefit of two learning processes by reconstructing the temporal context in two directions. The two learning directions regularize each other, thereby reducing the overfitting problem. The two contributions yield a new video representation, which achieves the best performance in two different tasks. Note that the method proposed in has been demonstrated to be the best single feature for event detection, and our method outperforms this method with a relative improvement of 10.4% and 4.5% on two challenging datasets, i.e., MEDTest-13 and MEDTest-14 respectively. In the video captioning task, our single feature outperforms other state-of-the-art methods across all evaluation metrics, most of which use multiple features. It is worthwhile mentioning that in very rare cases, one method can outperform all others for video captioning over all evaluation metrics. These results demonstrate the effectiveness of the proposed method.
Research efforts to improve visual representations for videos have been ongoing. Local features such as HOF  and MBH  extracted along spatio-temporal tracklets have been used as motion descriptors in the Dense Trajectories feature  and its variants . However, it is notoriously inefficient to extract hand-crafted features like improved Dense Trajectories (iDT) [46, 48], mostly due to the dense sampling nature of local descriptors and the time-consuming extraction of optical flows. On the other hand, the classification performance of state-of-the-art hand-crafted features has been surpassed by many methods based on neural networks in web video classification and action recognition tasks [48, 47].
Convolutional Networks for video classification. One way to use ConvNets for video classification is to perform temporal pooling over convolutional activations. Ng et al. 
proposed learning a global video representation by using max pooling over the last convolutional layer across video frames. Wanget al.  aggregated ConvNet features along the tracklets obtained from iDT. Xu et al.  applied VLAD encoding  over ConvNet activations and found that the encoding methods are superior to mean pooling. The other common solution is to feed multiple frames as input to ConvNets. Karpathy et al.  proposed a convolutional temporal fusion network, but it is only marginally better than the single frame baseline. Tran et al.  avoided the extraction of optical flows by utilizing 3D ConvNets to model motion information. Simonyan and Zisserman  took optical flows as the flow image input to a ConvNet, and this two-stream network has much better performance than the previous networks on action recognition.
Recurrent Networks for video classification. Ng et al.  and Donahue et al.  investigated the modeling of temporal structures in videos with Long Short-Term Memory (LSTM) . However, even with five-layer LSTMs, trained on millions of videos, they do not show promising performance compared to ConvNets . Patraucean et al.  used a spatio-temporal autoencoder to model video sequences through optical flow prediction and reconstruction of the next frame. Ballas et al.  used a Convolutional Gated Recurrent Unit (ConvGRU) which leverage information from different spatial levels of the activations. Srivastava et al.  used LSTM to model video sequences in an unsupervised way. In this work, we utilize the RNNs on video representation learning, improving the representation by being aware of the multirate nature of video content. Moreover, the temporal consistency between frames in the neighborhood is incorporated into the networks in an unsupervised way, providing richer training information and creating opportunities to learn from abundant untrimmed videos.
Video captioning. Video captioning has emerged as a popular task in recent years, since it bridges visual understanding and natural language description. Conditioned on the visual context, RNNs produce one word per step to generate captions for videos. Venugopalan et al.  used a stacked sequence to sequence (seq2seq)  model, in which an LSTM is used as a video sequence encoder and the other LSTM serves as a caption decoder. Yao et al.  incorporated the temporal attention mechanism in the description decoding stage. Pan et al.  proposed using a hierarchical LSTM to model videos sequences, while Yu et al.  used a hierarchical GRU network to model the structure of captions. In this work, we demonstrate that the strong video representation learned in our model improves the video captioning task, confirming the generalization ability of our features.
In this section, we introduce our approach for video sequence modeling. We first review the structure of Gated Recurrent Unit (GRU) and extend the GRU to a multirate version. The model architecture for unsupervised representation learning is then introduced, which is followed by task specific models for event detection and video captioning. In the model description, we omit all bias terms in order to increase readability.
Gated Recurrent Unit. At each step , a GRU cell takes a frame representation and previous state as inputs and generates a hidden state and an output which are calculated by,
where is the input, is the reset gate, is the update gate, is the proposed state, is the internal state,
is the sigmoid activation function,and are weight matrices, and is element-wise multiplication. The output
is calculated by a linear transformation from the state. We denote the whole process as:
and when it has iterated steps, we can obtain the state of the last step .
Multirate Gated Recurrent Unit (mGRU). Inspired by clockwork RNN , we extend the GRU cell to a multirate version. The clockwork RNN uses delayed connections for inputs and inter-connections between steps to capture longer dependencies. Unlike traditional RNNs where all units in the states follow the protocol in Eq. 1, states and weights in the clockwork RNN are divided into groups to model information at different rates. We divide state into groups, and each group has a clock period , where . can be arbitrary numbers, and we empirically use and set . Faster groups (with smaller ) take inputs more frequently than slower groups, and the slower module skips more inputs. Formally, at each step , matrices of the group satisfying are activated and are used to calculate the next state, which is
where the state weight matrices are divided into block-rows and each block-row is partitioned into block-columns. denotes the sub-matrix in block-row and block-column . The input weight matrices are divided block-rows and denotes the weights in block-row and
Two modes can be used for state transition. In the slow to fast mode, states of faster groups consider previous slower states, thus the faster states incorporate information not only at the current speed but also information that is slower and more coarse. The intuition for the fast to slow mode is that when the slow mode is activated, it can take advantage of the information already encoded in the faster states. The two modes are illustrated in Figure 2. Empirically, we use the fast to slow mode in our model as it performed better in the preliminary experiments.
If , the previous state is directly passed over to the next state,
Figure 3 illustrates the state iteration process. Note that not all previous modules are considered to calculate the next state at each step, thus fewer parameters will be used and the training will be more efficient.
Video sequences are highly correlated to their neighboring context clips. We use the idea of context reconstruction for video sequence modeling. The similar methods have been successfully applied for language modeling and other language tasks [25, 20]. In the unsupervised training process, we follow the classic sequence-to-sequence (seq2seq) model 
where an encoder encodes a sequence of inputs and passes the last state to the decoder for target sequence generation. In our scenario, the mGRU encoder takes frame-level features extracted from the pre-trained convolutional models as inputs and generates the state at each step which will be attended by the decoders. The state of the last step of the encoder is passed to the decoder,i.e., . Two decoders are used to predict the context sequences of the inputs, i.e., reconstructing the frame-level representations of the previous sequence and next sequence.
Decoder. We use the seq2seq model with attention mechanism to model video temporal structures via context reconstruction. We denote that is the previous sequence of input sequence , and is the next sequence. The decoder is a GRU conditioned on the encoder outputs and the last step state of the encoder. We use the attention mechanism at each step to help the decoder to decide which frames in the input sequence might be related to the next frame reconstruction. At step , the decoder generates the prediction by calculating,
where , is the normalized attention weight for encoder output and is the weighted average of the encoder outputs. We use two decoders that do not share parameters: one for the past sequence reconstruction and the other for the future sequence reconstruction (Figure 4). The decoders are trained to minimize the reconstruction loss of two sequences, which is
We choose the Huber loss for regression due to its robustness following Girshick ,
We set in all experiments.
For the past reconstruction, we reverse the input order as well as the target order to minimize information lag 
. The two decoders are trained with the encoder via backpropagation, and we regularize the network by randomly dropping one decoder for each batch. As we have two decoders in our model, each decoder will have the probability of being chosen for training of 0.5 (Figure.4).
During unsupervised training, we uniformly sample video frames and extract frame-level features from convolutional models. We set the sequence length to , i.e., the encoder takes frames as inputs, while the decoders reconstruct previous frames and next frames. We randomly sample a temporal window of consecutive frames (3 segments) during training. If the video length is less than
, we pad zeros for each segment.
We validate the unsupervised learned features on the task of complex event detection. We choose the TRECVID Multimedia Event Detection (MED) task as it is more dynamic and complex compared to the action recognition task, in which the target action duration is short and usually lasts only seconds. As the features from the unsupervised training are not discriminative,i.e., label information has not been applied during training, we further train the encoder for video classification. We use the mGRU encoder to encode the video frames and take the last hidden state in the encoder for classification. We do not apply losses at each step, e.g., the LSTM model in 
, as the video data in our task is untrimmed, which is more noisy and redundant. We use the network structure of FC(1024)-ReLU-Dropout(0.5)-FC(1024)-ReLU-Dropout(0.5)-FC(class_num+1)-Softmax. Since there are background videos which do not belong to any target events, we add another class for these videos.
During supervised training, we first initialize the weights of the encoder with the weights pre-trained via unsupervised context reconstruction. For each batch, instead of uniformly sampling videos within the training set, we keep the ratio of the number of positive and background videos to . We bias the mini-batch sampling because of the imbalance between the positive and negative examples.
During inference, the encoder generates multirate states at each step, and there are several ways to pool the states to obtain a global video representation. One simple approach is to average the outputs, and the obtained global video representation is then classified with a Linear SVM. The other way is to encode the outputs with an encoding method. Xuet al. 
found that Vector of Locally Aggregated Descriptors (VLAD) encoding outperforms average pooling and Fisher Vectors  over ConvNets activations by a large margin on the MED task. We thus apply the VLAD encoding method to encode the RNN representations.
Given inputs and centers
which are calculated by the k-means algorithm on sampled inputs, for each, we have,
where is assigned to the center if it is the nearest center. Concatenating over all centers, we obtain the feature vector of size where is the dimension of . Normalization methods are used to improve the encoding performance. Power normalization, often signed square rooting (SSR), is usually used to convert each element into . The intra-normalization method normalizes representations for each center, followed by the normalization for the whole feature vector . The final normalized representation is classified with a Linear SVM.
Note that the states in mGRU are divided into groups, we thus encode the state of the three different scales independently. We combine the three scores by average fusion.
We also demonstrate the generalization ability of our proposed video representation on the video captioning task. In video captioning, an encoder is used to encode video representations and a decoder is used to generate video descriptions. We follow the basic captioning decoding process. Given a video sequence and a description sequence , where each word is represented by a one-hot vector and a one-of- ( is the vocabulary size) embedding is used in the decoder input to represent a discrete word with a continuous vector, the overall objective is to maximize the log-likelihood of the generated sequence,
Softmax activation is used on the decoder output to obtain the probability of word . The attention mechanism (Eq. 6) is used in both the input and output of the decoder.
We show the results of our experiments on complex event detection and video captioning tasks. We implement our model using the TensorFlow framework. Source code and trained models will be released upon acceptance.
We collect approximately 220,000 videos without label information from TRECVID MED data, which excludes videos in MEDTest-13 and MEDTest-14, for unsupervised training. The average length of the collected videos is 130 seconds with a total duration of more than 8,000 hours.
We use the challenging MED datasets with labels, namely, TRECVID MEDTest-13 100Ex  and TRECVID MEDTest-14 100Ex  for video classification111Development data is not updated for TRECVID MED 15 and TRECVID MED 16 competition.. There are 20 events in each dataset, 10 of which overlap. It consists of approximately 100 positive exemplars for each event in the training set, and 5,000 negative exemplars. In the testing set, there are about 23,000 videos and the total duration in each collection is approximately 1,240 hours. The average video length is 120 seconds. These videos are temporally untrimmed YouTube videos of various resolutions and quality. We use the mean Average Precision (mAP) as the performance metric following the NIST standard [1, 2].
For both unsupervised training and classification, we uniformly sample video frames at the rate of 1 FPS and extract features for each frame from GoogLeNet with the Batch Normalization pre-trained on ImageNet. Following standard image preprocessing procedures, the shorter edges of frames are rescaled to 256 and we crop the image to . We use activations after the last pooling layer and obtain representations with length 1,024. There are 20 classes in the MEDTest-13 and MEDTest-14 datasets, thus with the background class, we have 21 classes in total. In the training stage, we set sequence length to 30 and pad zeros if the video has fewer than 30 frames. During inference, we take the whole video as input and use 150 steps.
Training details. We use the following settings in all experiments unless otherwise stated. The model is optimized with ADAM , and we fix the learning rate at and clip the global gradients at norm 10. We use a single RNN layer for both the encoder and decoder, and the cell size is set to 1,024. We set the attention size to 50 and regularize the network by using Dropout  in the input and output layer . We also add Dropout when the decoder copy state from the encoder and all dropout probability is set to 0.5. Weights are initialized with Glorot uniform initialization  and weight decay of is applied for regularization.
In the supervised training, we initialize the weights of the encoder using the learned weights during unsupervised learning, and the same sequence length is used as in the unsupervised training stage.
Average pooling. For the GoogLeNet baseline, we average frame-level features and use a Linear SVM for classification. For our model, we first train an unsupervised encoder-decoder model with mGRU and fine-tune the encoder with label information. To make a fair comparison with the GoogLeNet baseline, we extract outputs of the mGRU encoder at each step and average them to obtain a global representation for classification. Note that both feature representations have same dimensions and we empirically set for both of the linear classifiers. The result is shown in Table 1 and shows that our model with temporal structure learning is able to encode valuable temporal information for classification.
VLAD Encoding. We now show that VLAD encoding is useful for aggregating RNN representations. We compare our method with GoogLeNet features using VLAD encoding. Following , we set the number of k-means centers to 256 and the dimension of PCA is 256. Three scales are learned at each step for our mGRU model. We divide the state into three segments and each sub-state is individually aggregated by VLAD. Note that each encoded representation has the same feature vector length as the GoogLeNet model, and we use late fusion to combine the scores of the three scales. The results in Table 2 show that our mGRU model outperforms GoogLeNet features when encoded by VLAD. It also shows that VLAD encoding outperforms average pooling for RNN representations. Our model also achieves state-of-the-art performance on the MEDTest-13 and MEDTest-14 100Ex datasets.
We compare several variants in the unsupervised training, and show the performance of different components. The results are shown in Table 3. We obtain features from the unsupervised model by extracting states from the encoder at each step, which are then averaged to obtain a global video representation. The results show that the representation learning from unsupervised training without discriminative information also achieves good results.
Attention. We compare our model with a model without the attention mechanism, where temporal attention is not used and the decoder is forced to perform reconstruction based only on the last encoder state, i.e., “mGRU w/o attention” in Table 3. The results show that the attention mechanism is important for learning good video representations and also helps the learning process of the encoder.
Context. In a model without context reconstruction, i.e., only one decoder is used (autoencoder), neither past nor future context information is considered, i.e., “mGRU w/o context” in Table 3. The results show that with context prediction, the encoder has to consider temporal information around the video clip, which models the temporal structures in a better way.
Multirate. We also show the benefit of using mGRU by comparing it with the basic GRU, i.e., “mGRU w/o multirate” in Table 3. Note that the mGRU model has fewer parameters but better performance. It shows that an mGRU that encodes multirate video information is capable of learning better representations from long, noisy sequences.
Pre-training. We now show the advantages of the unsupervised pre-training process by comparing an encoder with random initialization with the same encoder whose weights are initialized by the unsupervised model. The result is shown in Table 4 and demonstrates that the unsupervised training process is beneficial to video classification. It incorporates context information in the encoder, which is an important cue for the video classification task.
|mGRU w/o attention||32.7||27.5|
|mGRU w/o context||37.1||30.1|
|mGRU w/o multirate||36.5||29.3|
We compare our model with other models and the results are shown in Table 5. Our single model achieves the state-of-the-art performance on both the MEDTest-13 and MEDTest-14 100Ex settings compared with the performances of other single models. We report the C3D result by using the pre-trained model  and we set the length of the input short clip to 16. Features are averaged across clips which are classified with a Linear SVM. Our model with VLAD encoding outperforms previous state-of-the-art results with 4.2% on MEDTest-13 100Ex and 1.6% on MEDTest-14 100Ex.
|IDT + FV ||34.0||27.6|
|IDT + skip + FV ||36.3||29.0|
|VGG + RBF ||-||35.0|
|VGG16 + VLAD ||-||33.2|
We now validate our model on the video captioning task. Our single model outperforms previous state-of-the-art single models across all metrics.
We use the YouTube2Text video corpus  to evaluate our model on the video
captioning task. The dataset has 1,970 video clips with an average duration of 9 seconds.
The original dataset contains multi-lingual descriptions covering various domains,
e.g., sports, music, animals.
Following , we use English descriptions only and
split the dataset into training, validation and testing sets containing 1,200, 100, 670
video clips respectively.
In this setting, there are 80,839 descriptions in total with about 41 sentences per video clip.
The vocabulary size we use is 12,596 including
We evaluate the performance of our method on the test set using the evaluation script provided by  and the results are returned by the evaluation server. We report BLEU , METEOR  and CIDEr  scores for comparison with other models. We stick with a single rule during model selection, namely we choose the model with the highest METEOR score on the validation set.
The video length in the YouTube2Text dataset is short, thus we uniformly sample frames at a higher frame rate of 15 FPS. The sequence length is set to 50 and we use the default hyper-parameters in the last experiment. We use two different convolutional features for the video captioning task, i.e., GoogLeNet features and ResNet-200 features . We use beam search during decoding by default and set the beam size to 5 following  in all experiments. Attention size is set to 100 empirically.
We first use GoogLeNet features and the result is shown in Table 6. We compare our mGRU with GRU which shows that mGRU outperforms GRU on all metrics except BLEU@1. However, the difference is only 0.04%. We initialize the mGRU encoder via unsupervised context learning and the result shows that with good initialization, performance is improved by more than 1.0% on the BLEU and CIDEr scores and 0.6% on the METEOR score compared with random initialization. We also utilize the recent ResNet-200 network as a convolutional model. We use the pre-trained model and follow the same image preprocessing method. The result of using ResNet-200 is shown in Table 7 and demonstrates that our MVRM method not only works better than GRU on different tasks, but also works better on different convolutional models. Additionally, we can improve all the metrics with ResNet-200 network.
We compare our methods with other models on the YouTube2Text dataset. Results are shown in Table 8. “S2VT”  is the first model to use a general encoder-decoder model for video captioning. “Temporal Attention”  uses the temporal attention mechanism on the video frames to obtain better results. “Bi-GRU-RCN”  uses a ConvGRU to encode activations from different convolutional layers. “LSTM-E”  uses embedding layers to jointly project visual and text features. Our MVRM method has similar performance to , but with the pre-training stage, we outperform  in all metrics. Some methods fuse additional motion features like C3D  features, e.g., Pan et al.  obtained 33.9% on METEOR after combing multiple features. With ResNet-200, we can obtain 34.45% on METEOR.
|Temporal attention ||41.92||29.60||51.67|
|GoogLeNet+ Bi-GRU-RCN ||48.42||31.70||65.38|
|GoogLeNet+ Bi-GRU-RCN ||43.26||31.60||68.01|
|GoogLeNet+HRNE+ Attention ||43.80||33.10||-|
In this paper, we propose a Multirate Visual Recurrent Model to learn multirate representations for videos. We model the video temporal structure via context reconstruction, and show that unsupervised training is important for learning good representations for both video classification and video captioning. The proposed method achieves state-of-the-art performance on two tasks. In the future, we will investigate the generality of the video representation in other challenging tasks, e.g., video temporal localization  and video question answering [53, 40]
TensorFlow: A system for large-scale machine learning.In OSDI, 2016.
Efficient estimation of word representations in vector space.In ICLR, 2013.