Modeling Multimodal Clues in a Hybrid Deep Learning Framework for Video Classification

06/14/2017 ∙ by Yu-Gang Jiang, et al. ∙ Columbia University 0

Videos are inherently multimodal. This paper studies the problem of how to fully exploit the abundant multimodal clues for improved video categorization. We introduce a hybrid deep learning framework that integrates useful clues from multiple modalities, including static spatial appearance information, motion patterns within a short time window, audio information as well as long-range temporal dynamics. More specifically, we utilize three Convolutional Neural Networks (CNNs) operating on appearance, motion and audio signals to extract their corresponding features. We then employ a feature fusion network to derive a unified representation with an aim to capture the relationships among features. Furthermore, to exploit the long-range temporal dynamics in videos, we apply two Long Short Term Memory networks with extracted appearance and motion features as inputs. Finally, we also propose to refine the prediction scores by leveraging contextual relationships among video semantics. The hybrid deep learning framework is able to exploit a comprehensive set of multimodal features for video classification. Through an extensive set of experiments, we demonstrate that (1) LSTM networks which model sequences in an explicitly recurrent manner are highly complementary with CNN models; (2) the feature fusion network which produces a fused representation through modeling feature relationships outperforms alternative fusion strategies; (3) the semantic context of video classes can help further refine the predictions for improved performance. Experimental results on two challenging benchmarks, the UCF-101 and the Columbia Consumer Videos (CCV), provide strong quantitative evidence that our framework achieves promising results: 93.1% on the UCF-101 and 84.5% on the CCV, outperforming competing methods with clear margins.



There are no comments yet.


page 2

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Classifying videos based on content semantics has been a hot research topic in multimedia for over a decade. Related techniques can be deployed in a multitude of applications such as video indexing, retrieval, advertising, etc. The key enabling factors behind the significant technical progress in recent years are discriminative and robust feature representations that can not only withstand large intra-class variations but also effectively differentiate multiple classes. Some popular feature descriptors such as SIFT [32] and HOG [5] model spatial clues like texture, while others such as HOF [6] and trajectory features [53, 19, 40], focus on motion information, a fundamental nature of video depicting movements of objects among adjacent frames. Recently, deep neural networks, especially Convolutional Neural Networks (CNNs), have demonstrated great potentials for deriving robust features from raw data on a variety of tasks, including image classification [27], object detection [10], speech recognition [11], etc. Researchers have also attempted to apply deep learning techniques to the video domain. For instance, a straightforward extension is to stack multiple frames over time as inputs to CNNs for spatial-temporal feature learning [24, 16, 50]. Different from these works, Simonyan et al. [41] disentangled video feature learning with two independent CNNs operating on RGB frames and stacked optical flow images to capture spatial and motion information, respectively. Final predictions are derived by linear combination of scores from the two CNNs and the results are competitive with state-of-the-art trajectory features [53].

However, these works merely focus on appearance and motion information in videos, ignoring the abundant long-range temporal clues therein because the training of CNNs totally neglects the order of inputs (i.e., RGB frames or stacked optical flow images). In addition, the motion CNN can only account for object movements within very short time periods. We believe this is not satisfactory for understanding video contents since different segments of videos usually correspond to different states of actions/events and their temporal order can assist recognition. For example, a “celebrating birthday” event could start with “making a wish”, followed by “blowing out candles”, and finally ends with “eating cakes”. Moreover, audio signal is an indispensable component of video data, providing complementary clues to visual information. In the case of a “celebrating birthday” event, a birthday song is typically associated with the video.

Further, video semantics usually do not occur in isolation, and recognizing a class of interest could benefit from its semantic contextual relationships. For example, similar human motion patterns can be observed in “running” and “playing tennis”, and the likelihood of a video containing “running” could potentially help recognize “playing tennis”. These useful clues are either overlooked or modeled with complicated models that are infeasible to scale up in most existing works.

Fig. 1: The pipeline of the proposed hybrid deep learning framework. For a video clip, we first extract spatial, motion and audio features with three CNNs operating on video frames, stacked optical flow images and audio signals respectively. To capture long-range temporal dynamics in videos, we leverage two LSTM models with inputs of the extracted spatial and motion features. Further, we also utilize a feature fusion network to integrate multiple features into a unified representation to perform classification with carefully designed regularizations aiming to exploit feature relationships. Finally, we combine the outputs of the LSTM models with feature fusion network with refinement to generate final prediction scores. See texts for more discussions.

To mitigate these limitations, we propose a hybrid deep learning framework for video categorization that is designed to explore the abundant multimodal clues embedded in videos, including static spatial, motion patterns, audio information and long-range temporal coherence as well as the contextual relationships among video semantics. Motivated by the great success of Recurrent Neural Networks (RNNs) for sequence modeling tasks 

[11, 12]

, we leverage Long Short Term Memory (LSTM), a variant of RNNs with memory units and different functional gates, to account for temporal information. Furthermore, different from existing methods that integrate features in a straightforward and heuristic way by either feature concatenation or score averaging, we are interested in exploring the feature correlations. To this end, we apply a deep neural network with carefully designed regularizations 

[22] to integrate the extracted static appearance, short-term motion and audio features. Then we combine the predictions from this network with the outputs of LSTMs. Finally, we refine the prediction scores in consideration of contextual relationships among video semantics in a simple yet effective manner.

The framework is illustrated in Figure 1. In particular, we first compute spatial appearance, short-term motion (based on stacked optical flow images) and audio features with CNN models. The spatial and motion features are further utilized as inputs of LSTMs to capture the long-range temporal temporal clues. Then, a feature fusion network takes the video-level features (spatial, motion and audio) to derive a unified representation for predicting video semantics. The outputs of the feature fusion networks are further combined with scores from the LSTMs and then refined by taking advantage of the contextual relationships of video semantics. The main contributions are summarized as follows:

  • We fully exploit a variety of multimodal clues in a hybrid deep learning framework for improved video categorization, including static spatial appearance, motion and audio information, long-range temporal coherence and contextual relationships among video semantics.

  • We demonstrate that the LSTMs, modeling the long-range temporal information in video sequences through an explicitly recurrent manner, are highly complementary with CNNs.

  • We resort to the rich contextual relationships among video semantics in a simple yet effective way to further refine predictions for improved performance.

  • We conduct experiments on two challenging benchmarks, and the experimental results provide strong quantitative evidence that our framework achieves promising results, outperforming competing methods with clear margins.

This work extends from a conference paper [59] by incorporating audio and semantic contextual relationships in the hybrid framework. New experiments are conducted to verify the effectiveness of the technical extensions and extra amplified discussions are provided throughout the paper. The remaining sections are organized as follows. We first review related works in Section II and elaborate the proposed hybrid deep learning framework in Section III. We then present and discuss the experimental results and comparisons in Section IV. Finally, Section V concludes this paper.

Ii Related Works

We divide the discussions of related works into the following five subsections.

Ii-a Hand-crafted Features

There is a large body of literature on video classification in the multimedia community (see [18] for a survey). Among these works, designing powerful feature representations is an important topic due to the significant role of features in a typical video recognition pipeline. The success of image descriptors like SIFT and HoG has spurred the developments of video representations by considering the temporal feature of videos. For example, Harris corner detector is extended into 3D volumes to identify space-time interest points [30]. Similarly, based on HoG features, 3D spatial-temporal gradients are derived as local descriptors for action recognition [25]. Wang et al.

proposed to track densely sampled local patches over time in an optical flow field to compute dense trajectory features, which achieved superior performance on a variety of benchmarks when coupled with quantization techniques like Bag-of-Words and Fisher Vector 

[37, 13]. However, these video representations focus on modeling local motion patterns within short time periods and the feature encoding methods while powerful totally discards the temporal information of videos.

Ii-B CNN Representations

Different from hand-crafted features, recent advances on CNNs in image [27, 10] and speech domain [11] have encouraged works to learn features directly from raw video data. The most straightforward way to utilize CNN on video data is stacking frames as inputs with an aim to learn spatial-temporal features using 3D convolutions [16, 24, 50]. However, these works demonstrate worse performance than state-of-the-art trajectory features [53]. This might result from the difficulty to learn 3D features with insufficient training data. To effectively model 3D signals, Simonyan et al. proposed to utilize two independent CNNs to capture spatial and motion information operating on RGB frames and stacked optical flow images, separately. Based on this approach, Wang et al. proposed to learn the transformation between two states triggered by actions [56]. Feichtenhofer et al. experimented with different fusion approach to combine spatial and temporal features [9]. During the training process of CNNs, the temporal order of frames and stacked optical flow images is discarded and thus the temporal structures of videos are ignored.

Ii-C Temporal Information

Graphical models, including Conditional Random Fields (CRF), Hidden Markov Models (HMM),

etc., have been widely adopted to capture long-term temporal structures [51, 57, 49]. For example, Tang et al. proposed a variable duration HMM to model state changes in videos [49]. Instead of using graphical models, Fernando et al. utilized a ranking machine to account for the temporal order of frames. Wang et al. proposed the temporal segment networks, which used a consensus function to combine segment scores generated by two-stream networks [55].

Many works resort to LSTM to capture temporal dynamics in videos due to its great success in sequential modeling tasks like speech recognition [11] and video captioning [47]. Srivastava et al. proposed to learn video features using an auto-encoder framework [45] based on LSTMs. Donahue et al.

utilized two LSTM models using spatial and motion features extracted from CNN models 

[8]. Ng et al. further deepened LSTM to five layers and experimented with several pooling strategies [35]. Our work leverages LSTMs for temporal modeling to explicitly complement the limitation of the frame-based CNN models.

Ii-D Feature Fusion

Extensive works have been conducted on the fusion of multiple features, the complementarity of which is expected to promote classification accuracy. There are two popular fusion strategies, i.e. early fusion and late fusion performed at the feature level and the classification score level, respectively [43, 61]. Typically, early fusion integrates features by direct concatenation [53] or linear combination of their kernels [65] before classification. In addition, Multiple Kernel Learning (MKL) can also be applied to combine feature kernels, where the weights are automatically learned. Late fusion, on the other hand, combines prediction scores from multiple classifiers, each of which is independently trained with a single feature [63, 31]. Both fusion methods are popular due to their simplicity, however, they assume the features or prediction scores are explicitly complementary to one another and fail to consider potential hidden correlations among features. Recently, Srivastava et al.

utilized Deep Boltzmann Machines (DBM) to derive an embedding of images and texts 

[45] and Ngiam et al. used deep auto-encoder to learn the relationships between different modalities [36]. Wu et al. proposed to explore feature and class relationships [58] by imposing trace norms. In this work, to alleviate computational complexity, we adopt a regularized neural network to automatically learn dimension-wise correlations of features extracted from state-of-the-art CNN models.

Ii-E Contextual Relationships

As aforementioned, the co-occurrence of video semantics, serving as context, can provide useful information. For example, Rabinovich et al. proposed to incorporate the semantics context information with a CRF model [38]. Jiang et al. modeled the class relationships with a semantic diffusion algorithm [21]. Deng et al. leveraged a graphical model to encode label hierarchies for improved image classification performance [7]. Wu et al. proposed to capture the relationships of video semantics by regularizing the classification process [58]. Chen et al.

utilized confusion matrix to predict the context of a category when training CNNs 

[3]. In our paper, we propose to utilize confusion matrix as contextual relationships derived from trained models, to refine the prediction scores as a post-processing step. Therefore, the recognition of a class of interest can benefit from its related classes.

Iii Methodology

We now elaborate the proposed hybrid deep learning framework illustrated in Figure 1. We first introduce the multimodal features extracted by CNN models, and present the modeling the temporal dynamics in videos with LSTM models. Then we describe the feature fusion framework which is designed to model feature correlations. Finally, we introduce the contextual refinement.

Iii-a Spatial, Motion and Audio CNN Features

CNN models usually contain alternating convolutional and pooling layers to learn features from input images, followed by fully-connected (FC) layers for classification. In our framework, we first compute spatial and motion features based upon the two-stream approach [41], where two independent CNNs are trained with RGB frames and stacked optical flow images, respectively. More concretely, the spatial stream models static appearance information like texture from sampled video frames as in conventional CNNs for image classification. The motion CNN takes stacked optical flow images as inputs to capture object movements within a short time window. Optical flow is an explicit form of motion patterns derived by computing displacement vector fields between two adjacent frames, whose horizontal and vertical components are then used to generate two images. Multiple optical flow images are further stacked to represent motion information in a short period, upon which convolution is performed. Given a video at testing phase, each stream averages prediction scores produced by soft-max layer from 25 uniformly sampled frames (or stacked optical flow images) and the scores from the two streams are further linearly combined as the final prediction. In our work, we compute the outputs from the first FC layer of two CNNs, which are observed to be effective in many tasks [39], as the spatial and motion features to model long-term temporal structures and explore their correlations for improved performance.

In addition, we also utilize a CNN model to capture the acoustic information in videos as a compliment to visual information. Particularly, we convert the 1D soundtrack extracted from a video clip into a 2D spectrogram image with Short-Time Fourier Transformation, demonstrating changes of frequency-scale along with time. Then, inspired by 

[52], we take the spectrograms as inputs to a CNN network to capture the acoustic clues.

Iii-B Temporal Modeling with LSTM

As aforementioned, the two-stream approach focuses only on appearance and short-time motion information, which ignores the long-term temporal dynamics in videos. Therefore, we employ the LSTM model due to its great success in sequential modeling tasks [11, 8, 62]. Compared with conventional RNN models that map input data recursively to outputs through hidden states, an LSTM additionally incorporates a memory cell with multiple gates governing information into and out of the cell, enabling it to model long sequences without suffering from the “vanishing gradients” effect.

Fig. 2: An illustration of an LSTM unit.

Formally, an LSTM takes a sequence as inputs and maps it to an output sequence by recursively computing activations of the units from to as following:

Here, at the -th time step, we denote the input features as and the hidden states as . And represents the contents of the memory unit. The activations of the input, forget and output gates are represented as , respectively. represents the transition weights from component to component , and is the corresponding bias term. In addition,

is the non-linear sigmoid function. We present the structure of an LSTM unit in Figure 


The memory cell regulated by different non-linear gates enables LSTM model to store information progressively. More concretely, for the -th time step, the current feature representation together with information from the past are fed into all gates and the memory cell. Past information stored in the memory cell regulated by the activations of the forget gate is linearly combined with the squashed inputs multiplied by the activation of the input gate to generate the current “memory”. This facilitates the LSTM model to learn when to utilize current information or forget previous contents. Furthermore, the information that will be used for future states is regulated by the output gate . The interactions between the memory units and these multiplicative gates allow LSTM to capture the temporal dynamics in long sequences, making it a natural fit for video classification.

One can also stack hidden states to deepen the LSTM model aiming to increase its discriminative power. A softmax layer can then be applied on top of the hidden states to obtain the prediction scores at each time-step. The training of LSTM is usually conducted with stochastic gradient descent using the Back-Propagation Through Time (BPTT) algorithm 


In our framework (illustrated in Figure 1), we model the temporal information in videos with two LSTMs, operating on a spatial feature sequence and a motion feature sequence , respectively. Once the model is trained, the two LSTM models will produce two sets of predictions: for the spatial stream and for the motion stream. We compute the prediction from the last time step of a sequence as the score for the entire video, because it contains information from all previous steps.

Iii-C Regularized Feature Fusion Network

The spatial, motion and audio features characterize the same video from different perspectives (i.e., person-related static appearance information, body motions and sound), and thus certain correlations between these features might exist. We posit that an ideal unified representation is expected contain information shared by multiple features as well as the special aspect of each feature. This requires modeling feature relationships explicitly instead of uniform fusion approaches. To this end, we utilize a regularized feature fusion network [22] to fully exploit feature relationships (see Figure 1). Given a video clip, we compute its video-level appearance and motion features by simply averaging descriptors from all frames. The spatial, motion and audio features are separately transformed into a higher space with one hidden layer. We then apply one hidden layer to absorb all the features to derive a unified representation, regularized by carefully designed norms to explore feature relationships.

We represent the -th training video as a 4-tuple , where and denote the video-level spatial and motion descriptors respectively, as the audio feature derived from the audio CNN and is the corresponding ground-truth label. We first consider the training of a neural network with a single feature as inputs. Let denote the non-linear function approximated by the neural network. To learn the optimal weights of model, we minimize the following objective function:


Here denotes the number of videos in the training set and the first item is the empirical loss, and the second term is a penalty on the weight matrices to prevent over-fitting or forcing sparsity, depending on different choices of norms.

We now introduce the fusion of multiple features in a regularized framework. Given three types of features, we first perform feature transformation independently and then integrate them to derive a fused representation. In the fusion process, we impose a structural norm to explore the relations of the features. The optimization problem now becomes:


Here , represents the stacked weights for the -th layer, where and denotes dimension for the unified feature representation.

Compared with Equation (1), an norm is appended to regularize the fusion process of the the -th layer aiming to exploit feature relationships. The is defined as , and we can see that it first computes norm for each row (weights of the three features), and then norm for the resulting vector, which will force the matrix to be row sparse and produce similar zero/nonzero patterns for the columns. In other words, the norm will be minimized when there are only a few non-zero rows in the weight matrix, which serve as the shared discriminative information of these features.

As aforementioned, we posit a good unified representation should be derived without loss of information of original features, which requires the fusion process not only leverages feature correlations but also preserves the special information of each feature. As such, we additionally regularize the fusion process with an norm and rewrite Equation (2) as following:


The regularizer complements the

norm to be robust by preventing it from sharing incorrect information, which enables different features to select different neurons (

i.e., the unique information of these features).

We now move on to discuss the optimization in Equation (3), which is nonconvex because of the multi-layer neural network. Therefore, we train the network using back-propagation with gradient descent method in two scenarios:

Input : , and : the video-level spatial, motion and audio CNN features of the -th video;
           : the corresponding ground-truth label;
           randomly initialized weights ;
1 begin
2       for  to  do
3             Run a feed-forward pass through the network to obtain perdition error;
4             for  to 1 do
5                   Gradient descent with Eqn. (5);
6                   if    then
7                         Update the weights with proximal operation with Eqn. (4);
9                   end if
11             end for
13       end for
15 end
Algorithm 1 Algorithm for training the regularized feature fusion network.
  1. The -th layer. Since the regularization is imposed only on the -th layer, we treat it differently when performing gradient descent. The difficulty of the optimization here lies in the last two non-smooth terms, which are non-differentiable. Thus we cannot directly apply gradient descent. Instead, we utilize the proximal operation to evaluate their gradients. More specifically, we split the objective function into two components:

    Here is a smooth function whose gradients are easy to obtain and is a non-smooth function. We utilize a proximal operator to update the weights for the -th iteration:

    where . Note that here is a combination of and norms, and thus the proximal operator can be can be derived as:


    where , and represents the -th row of matrix and , respectively.

  2. Other layers. Since there are no non-smooth regularizations for other layers, we compute their gradients directly and then update the weight matrix with gradient descent as in [1]. Let represent the gradients of , the weight matrix of the th layer is updated as:


Although the two regularization norms in function incur extra computation cost, it is worth noting that the complexity of computing the proximal operator is , which is fast to evaluate. The proposed method is also general for fusing more features at a linearly growing computational cost rather than cubic cost as in [22]. The overall training process of the feature fusion framework is presented in Alg. 1.

Iii-D Contextual Relationships

Given the classification scores from the two LSTMs and the regularized feature fusion network, accounting for spatial, motion, audio and long-term temporal clues in videos, we are interested in incorporating contextual relationships to further refine the outputs for improved performance. More specifically, for each video sample, we first linearly average the probabilities to obtain a compact prediction. Then we utilize a simple approach to refine the prediction with contextual relationships, which provide useful information of semantics co-occurrence. For example, “baseball” is more related to “soccer” than “diving”, since “diving” contains totally different motion patterns. And if the likelihood for the video to be “soccer” is extremely low, then it is also unlikely to be “baseball”. Existing works often resort to external knowledge like WordNet or word vectors to obtain class relationships, which are either hand-crafted or trained on text corpus and hence fail to consider visual patterns. In our work, we simply rely on the trained models to produce class relationships by computing the confusion matrix, which is a good indicator on how classes are related.

Formally, for a total of classes, we denote as the mapping from the input to the linearly averaged prediction and then the confusion matrix is defined as following:


Here, is the validation set and is the cardinality function. When , measures the number of samples originally belongs to the class but are misclassified into . It is easy to understand that if and are close, the value will be large since they are difficult to separate. Then for the -th video sample, we refine its prediction score by:


where is the final probability for the -th video sample. The recognition of a class of interest can benefit from the contextual relationships in that information from its related classes is utilized to adjust its confidence based on semantic co-occurrence.

Note that researchers also employ multi-label loss functions like hinge loss or ranking loss 

[2] to consider context in an explicit way but they are not suitable for single-label recognition tasks. Our approach models contextual relationships among classes by analyzing their appearance and motion patterns, and thus it is general to both multi-label and single-label scenarios.

Iii-E Discussion

The proposed framework is able to model a comprehensive set of multimodal features, including static appearance, motion patterns in a short time window, long-range temporal dynamics and acoustic clues, which are all critical for understanding video contents since they describe videos from different perspectives. In our framework, we train different components independently rather than jointly in an end-to-end manner. Although training jointly is theoretically feasible, it would require extra training samples to prevent under-fitting in the complicated process and it is observed in [8] the performance gain of joint training is rather marginal. In addition, separate training ensures flexibility in the framework, since a component can be replaced easily without incurring the re-training of the whole complex framework. For example, one can easily update the framework with more powerful CNN models like GoogleNet [48] and ResNet [14] or better RNN models [4]. The main purpose of this paper is to demonstrate that a comprehensive set of features are demanded for improved video classification. In addition, in this work, we mainly demonstrate audio information captured by a CNN model can serve as an effective complement to visual information, and thus we do not investigate modeling temporal audio dynamics with LSTMs.

Iv Experiments

In this section, we first introduce the experimental settings and then discuss the results of the proposed hybrid deep learning framework on two popular benchmarks.

Iv-a Experimental Setup

Iv-A1 Datasets

To investigate the effectiveness of the proposed hybrid deep learning framework, we utilize the following two benchmarks:

  • UCF-101 [44]. The UCF-101 benchmark is a widely adopted dataset for human action recognition, which contains 13,320 video clips manually annotated into 101 human actions, totaling 27 hours. We conduct experiments using three training and testing splits following the protocol defined in [20]. Performance is measured by the average classification accuracy of all three splits.

  • Columbia Consumer Videos (CCV) [23]. It consists of 9,317 videos collected from YouTube belonging to 20 categories, including “basketball”, “wedding dance”, “soccer”, etc. Following [23], we utilize a training set of 4,659 videos and a testing set of 4,658 videos. We compute average precision for each class and report the mean AP over all classes.

Iv-A2 Implementation Details

We utilize the VGG_19 network [42] to extract spatial features and the CNN_M model [41]

to compute motion and audio features, due to their expressive performance on the ImageNet ILSVRC-2012 validation set: a 7.5% and 13.5% top-5 error rates, respectively. We first pre-train the spatial and audio CNN with ImageNet data and then fine-tune the network on video frames and spectrograms respectively. Note that for the audio CNN we observe better performance with pre-training though the images are spectrograms. Due to the lack of existing models trained on 20 channels (the input data format for the motion CNN), we train the motion CNN from scratch. To further promote the performance, we also employ simple data augmentation methods like cropping and flipping as in 


We apply stochastic gradient descent using back-propagation to train the CNN models. We adopt a batch size of 256 and fix the momentum to be 0.9. To fine-tune the spatial and audio CNN, we first set the initial learning rate to and decay it by a factor of 10 after every 14K iterations. Different from [41], we begin with a smaller rate rather than . To train the motion network, we set the initial learning rate to

, and then decay it by a factor of 10 after every 100K iterations. We adopt the popular Caffe 

[17] toolbox with modifications to support parallel training on multiple GPUs for implementations.

To capture the long-range temporal dynamics, we utilize two two-layer LSTMs operating on spatial and motion CNN features respectively. Both LSTMs contain 1,024 hidden neurons for the first layer and 512 units for the second layer. We train the network with a parallel implementation of Back-Propagation Through Time (BPTT) algorithm. The mini-batch size is set to 10 and the maximal iterations to be 150K. We also fix the learning rate and momentum to and 0.9 respectively.

Finally, to learn the optimal weights for the feature fusion network, we follow the procedures described in Alg. 1. The network contains four hidden layers shown in Fig 1. More concretely, we first employ a layer with 200 neurons for each of the spatial, motion and audio feature for independent feature transformation, followed by one layer with 200 neurons to perform feature fusion. The derived unified feature representations are further trained to categorize videos into semantic classes. We utilize a learning rate of 0.7 and fix to to prevent over-fitting. and are selected using cross-validation.

Iv-A3 Compared Approaches

To evaluate the proposed framework, we compare with the following alternative competing methods: (1) Spatial CNN, Motion CNN and Audio CNN, which are independently trained with raw RGB frames, stacked optical flow images and audio spectrograms; (2) Spatial LSTM and Motion LSTM, which denote LSTM models operating on extracted spatial and motion CNN features respectively; (3) SVM-based Early Fusion (SVM-EF), which averages three -kernels derived from spatial, motion and audio features for classification with an SVM; (4) SVM-based Late Fusion (SVM-LF), which employs a separate SVM for each feature and then linearly average their prediction scores; (5) Multiple Kernel Learning (SVM-MKL), which integrates three features using the -norm MKL [26] with ; (6) Early Fusion with Neural Networks (NN-EF), which performs classification with a 4-layer neural network operating on the concatenated features; (7) Late Fusion with Neural Networks (NN-LF), which combines predictions from three individual neural networks trained on three types of features respectively; (8) Multimodal Deep Boltzmann Machines (M-DBM) [36, 46], which performs feature fusion in a DBM without regularizations; (9) RDNN [58], which utilizes a different regularization scheme with higher computational complexity.

Notice that the first two classes of methods are components of the proposed framework and we report their performance independently to better analyze their contribution in the overall framework. The remaining seven methods aim to integrate the spatial, motion and audio features to improve classification performance.

Iv-B Results and Discussions

Iv-B1 Multimodal Representations

Temporal Modeling

In this section, we investigate the effectiveness of LSTMs on modeling the long-range temporal dynamics in video sequences. Table I presents the results of different methods on UCF-101 and CCV. We first compare the performance of LSTM models with CNNs as shown in the top two groups. Since CNN models fail to take the temporal order of frames into consideration, we expect the performance of CNN models is worse than LSTMs. On UCF-101, we can see that Spatial LSTM slightly outperforms Spatial CNN, but the Motion LSTM is marginally worse than Motion CNN. Since the motion LSTM takes stacked optical flow images as inputs, we posit this might result from the lack of training data to learn the optimal weights unlike the training of Spatial LSTM, where a large number of redundant frames could be utilized.

For CCV, CNN models perform consistently better than LSTM models on both spatial and motion streams. Compared to UCF-101, CCV contains more diversified and noisy videos without post-editing, whose duration are also significantly longer than those in UCF-101 (in average, 80 seconds vs. 8 seconds). Therefore, the noises in such videos could significantly degrade the performance of LSTM models. The noisy nature of CCV videos can also be reflected by the relatively low performance of motion streams operating on optical flow images, which are sensitive to camera motions and cluttered backgrounds.

Spatial CNN 80.1% 75.0%
Spatial LSTM 83.3% 43.3%
Motion CNN 77.5% 58.9%
Motion LSTM 76.6% 54.7%

Audio CNN
16.2% 21.5%
CNN + LSTM (Spatial) 84.0% 77.9%
 CNN + LSTM (Motion) 81.4% 70.9%
CNN + LSTM (Spatial & Motion) 90.1% 81.7%
CNN + LSTM (Spatial & Motion) + Audio 90.3% 82.4%
TABLE I: Performance of the LSTM and the CNN models on UCF-101 and CCV. “+” indicates model fusion, which simply uses the average prediction scores of different models.
Audio Modeling

The performance of audio CNN is presented in the middle of Table I. Audio CNN operating on spectrograms achieves 16.2% and 21.5% on UCF-101 and CCV respectively. Note that the performance on UCF-101 is measured by mean accuracy over 101 classes, however only 51 categories contain soundtracks and thus the actual accuracy is 32.1%. Audio signals are usually not robust and discriminative as visual clues due to the noises in video backgrounds.

Feature Complementarity

We now study whether the extracted multimodal representations are complementary through linearly averaging the outputs of the trained models. Here we only adopt simple late fusion and we will experiment with different fusion strategies in Sec. 4.2.2.

Results are summarized in the bottom two groups of Table I. We first combine CNN and LSTM models for both spatial and motion streams, and the fusion offers significant performance gains on both benchmarks. The combination of CNN and LSTM on the spatial stream offers 0.7% and 2.9% improvements over the best single model on UCF-101 and CCV, respectively. On the motion stream, the performance gains of fusion are more noticeable, 3.9% and 12% on UCF-101 and CCV. The consistent trend when fusing CNN with LSTM models on both streams confirms the complementarity of these features. Further, we also combine all spatial and motion models, offering 90.1% and 81.7% on UCF-101 and CCV respectively. This clearly verifies that spatial and motion features are very complementary. In addition, we also incorporate audio clues to complement the visual information, and this entire set of features attains the highest performance on both datasets: 90.3% and 82.4%. Therefore, we believe a successful video classification system should integrate all these features.


Spatial SVM
78.6% 74.4%
Motion SVM 78.2% 57.9%
Audio SVM 16.7% 22.1%

86.9% 75.9%
SVM-LF 85.4% 75.1%
SVM-MKL 87.1% 75.6%
NN-EF 86.6% 76.1%
NN-LF 85.4% 75.4%
M-DBM 87.0% 76.0%
RDNN 88.4% 76.2%

Non-regularized Fusion Network
87.2% 75.8%
Regularized Fusion Network 88.7% 76.7%
TABLE II: Performance comparison on UCF-101 and CCV, using various fusion approaches to combine the multimodal clues.

Iv-B2 Feature Fusion

We now move on to evaluate the proposed regularized feature fusion network and compare with competing methods. Table II presents the results and comparisons. In particular, the first group compares the results of the spatial, motion and audio features using SVMs. This set of experiments serves as baselines to better understand the improvements of fusion using SVM classifiers (summarized in the second group of II). See Table I for results that are directly obtained from CNN models. We also compare with alternative neural network based fusion methods as summarized in the third group in Table II. Finally, we report the results of our method in the bottom row.

From the table, we can make the following observations (1) the fusion of multiple features offers performance gains on both UCF-101 and CCV and the improvements on UCF-101 are more significant than those on CCV; (2) the proposed feature fusion approach outperforms other neural network based methods; (3) the performance gain over the regularizer-free M-DBM network confirms modeling feature relationships is important during fusion; (4) our framework also outperforms RDNN slightly at a much lower cost as aforementioned.

To evaluate the contribution of norms in the objective function, we perform an ablation study and report the performance of the same network without any regularizers. Compared with the full model, the performance of the regularizer-free network drops 1.5% on UCF-101 and 0.9% on CCV.

Iv-B3 The Hybrid Framework

We now discuss the effectiveness of the entire hybrid deep learning framework. In particular, we linearly average classification scores computed from the two LSTM models and the feature fusion network, which offers promising results, a mean accuracy of 92.1% on UCF-101 and an mAP of 84.0% on CCV (shown in Table III), outperforming alternative methods by clear margins. The entire hybrid framework improves 3.4 and 7.3 percentage points over the regularized fusion network (in Table II) on UCF-101 and CCV respectively, which stems from the combination with temporal clues captured by LSTM models. It is worth noting that our framework also achieves better performance than simple late fusion method (last row in Table I), which performs fusion with the same set of features.

For categories like “graduation” and “birthday” party in CCV, it is easy to understand that the fusion with temporal clues could assist recognition. We also examine other categories like “cat” and “dog” to see if there are certain temporal patterns. Interestingly, as illustrated in Fig 3, we found many “cat” videos depicting a cat chasing objects or laser on the floor. Though the temporal order is not explicit, it could be captured by LSTM model for improved performance.

Fig. 3: Two example videos of class “cat” in the CCV dataset with similar temporal clues over time.

Finally, we refine the prediction scores from the hybrid framework using semantics context. The result are summarized in the last row of Table III. The contextual refinement is easy to perform but very effective, offering 1.0% and 0.5% performance gain over the original prediction scores. This confirms our assumption that related classes can assist the recognition of a class of interest. In addition, we also compare with DASD [21], which utilizes context in a graph diffusion framework. Our context modeling method outperforms DASD by 0.7 and 0.3 percentage points with much lower computational complexity on UCF-101 and CCV, respectively.

We further demonstrate per-class average precision after contextual refinement on CCV in Figure 4. As can be seen from the figure, contextual refinement improves over the original model for nearly all classes. In addition, for classes with lower performance like “bird” and “wedding reception”, the performance gains are more significant, resulting from the useful information borrowed from related classes.

Iv-B4 Speed Efficiency

. To investigate the efficiency of our framework, we report the average time to classify a UCF-101 video clip using a single NVIDIA Telsa K40 GPU once the network is trained. Given a video clip, it takes around 4.5 seconds to compute RGB frames, optical flow images and audio spectrograms. The extraction of spatial, motion and audio CNN features takes 12 seconds. Finally, computing and refining the prediction scores from the LSTM and the feature fusion network can be finished in 4.3 seconds.

Fig. 4: Per-class average precision with and without contextual refinement on CCV.

Donahue et al. [8]
82.9% Lai et al. [28] 43.6%

Srivastava et al. [45]
84.3% Jiang et al. [23] 59.5%

Wang et al. [53]
85.9% Xu et al. [60] 60.3%
Tran et al. [50] 86.7% Ma et al. [33] 63.4%
Simonyan et al. [41] 88.0% Jhuo et al. [15] 64.0%

Ng et al. [35]
88.6% Ye et al. [63] 64.0%
Lan et al. [29] 89.1% Liu et al. [31] 68.2%
Zha et al. [64] 89.6% Wu et al. [59] 83.5%
Wang et al. [54] 91.5% Nagel et al. [34] 71.7%
Wang et al. [56] 92.4%

Hybrid Framework
92.1% Hybrid Framework 84.0%
Hybrid Framework-DASD 92.4% Hybrid Framework-DASD 84.2%
Contextual Refinement 93.1% Contextual Refinement 84.5%

TABLE III: Comparison with state-of-the-art results.

Iv-B5 Comparison with State of the Arts

We also compare with several state-of-the-art results on both datasets. Results are summarized in in Table III. We can see from the table that the proposed hybrid deep learning framework produces strong performance on both datasets. Different from works that obtain competitive results on UCF-101 using dense trajectory features [53, 64], our framework is built upon neural networks with an aim to learn feature representations. Our proposed approach improves the original twos stream CNN by incorporating temporal and audio modeling as well as better fusion methods. Notice that a few recent approaches also leverage temporal information with LSTMs [8, 45]; they utilized different CNN models to compute features, and hence the results are not directly comparable. Notice that we expect further performance improvements with more advanced neural networks like ResNet on UCF-101 [55, 9]. On the CCV dataset, the proposed framework outperforms all the recent approaches that are designed to perform fusion by clear margins [60, 63, 15, 33, 31, 58].

V Conclusions

In this paper, we have proposed a novel hybrid deep learning framework to integrate a comprehensive set of multimodal clues for video categorization. More specifically, we utilize three independent CNN models operating on static frames, stacked optical flow images and audio spectrograms to compute spatial, motion and audio features, respectively. In order to utilize the long-range temporal clues in videos, we apply two LSTM models with the spatial and motion features as inputs. Since different features characterize the same video from different perspectives, we employ a regularized feature fusion network that derives a unified feature representation for recognizing video semantics. Finally, we also refine the classification scores, the linear combination of LSTM models and feature fusion network, with semantic contextual relationships.

Through an extensive set of experiments on two challenging benchmarks, we demonstrate that (1) the LSTMs, modeling the long-range temporal information in video sequences through an explicitly recurrent manner, are highly complementary with CNNs; (2) the rich contextual relationships among video semantics in a simple yet effective way to further refine predictions for improved performance. The experimental results provide strong quantitative evidence that our framework achieves promising results, outperforming competing methods with clear margins.


This work was supported in part by two grants from NSF China (#61622204, #61572134) and two grants from STCSM, Shanghai, China (#16QA1400500, #16JC1420401).


  • [1] Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural Networks: Tricks of the Trade. Springer, 2012.
  • [2] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
  • [3] X. Chen and A. Gupta.

    Webly supervised learning of convolutional networks.

    In ICCV, 2015.
  • [4] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio. Gated feedback recurrent neural networks. CoRR, 2015.
  • [5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
  • [6] N. Dalal, B. Triggs, and C. Schmid. Human detection using oriented histograms of flow and appearance. In ECCV, 2006.
  • [7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [8] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, 2015.
  • [9] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • [11] A. Graves, A. Mohamed, and G. E. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
  • [12] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 2005.
  • [13] X. Han, B. Singh, V. Morariu, and L. S. Davis. Vrfp: On-the-fly video retrieval using web images and fast fisher vector products. IEEE TMM, 2017.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [15] I.-H. Jhuo, G. Ye, S. Gao, D. Liu, Y.-G. Jiang, D. T. Lee, and S.-F. Chang. Discovering joint audio-visual codewords for video event detection. Machine Vision and Applications, 2014.
  • [16] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. In ICML, 2010.
  • [17] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia, 2014.
  • [18] Y.-G. Jiang, S. Bhattacharya, S.-F. Chang, and M. Shah. High-level event recognition in unconstrained videos. IJMIR, 2013.
  • [19] Y.-G. Jiang, Q. Dai, T. Mei, Y. Rui, and S.-F. Chang. Super fast event recognition in internet videos. IEEE TMM, 2015.
  • [20] Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. THUMOS challenge: Action recognition with a large number of classes., 2014.
  • [21] Y.-G. Jiang, J. Wang, S.-F. Chang, and C.-W. Ngo. Domain adaptive semantic diffusion for large scale context-based video annotation. In ICCV, 2009.
  • [22] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE TPAMI, 2017.
  • [23] Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. C. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR, 2011.
  • [24] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [25] A. Klaser, M. Marszałek, and C. Schmid. A spatio-temporal descriptor based on 3d-gradients. In BMVC, 2008.
  • [26] M. Kloft, U. Brefeld, S. Sonnenburg, and A. Zien. Lp-norm multiple kernel learning. JMLR, 2011.
  • [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • [28] K.-T. Lai, F. X. Yu, M.-S. Chen, and S.-F. Chang. Video event detection by inferring temporal instance labels. In CVPR, 2014.
  • [29] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Beyond gaussian pyramid: Multi-skip feature stacking for action recognition. CoRR, 2014.
  • [30] I. Laptev. On space-time interest points. IJCV, 2007.
  • [31] D. Liu, K.-T. Lai, G. Ye, M.-S. Chen, and S.-F. Chang. Sample-specific late fusion for visual category recognition. In CVPR, 2013.
  • [32] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.
  • [33] A. J. Ma and P. C. Yuen. Reduced analytic dependency modeling: Robust fusion for visual recognition. IJCV, 2014.
  • [34] M. Nagel, T. Mensink, and C. G. M. Snoek. Event fisher vectors: Robust encoding visual diversity of visual streams. In BMVC, 2015.
  • [35] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [36] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng. Multimodal deep learning. In ICML, 2011.
  • [37] D. Oneata, J. Verbeek, C. Schmid, et al. Action and event recognition with fisher vectors on a compact feature set. In ICCV, 2013.
  • [38] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie. Objects in context. In ICCV, 2007.
  • [39] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. CNN features off-the-shelf: an astounding baseline for recognition. In CVPR Workshop, 2014.
  • [40] Y. Shi, Y. Tian, Y. Wang, and T. Huang. Sequential deep trajectory descriptor for action recognition with three-stream cnn. IEEE TMM, 2016.
  • [41] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [43] C. G. Snoek, M. Worring, and A. W. Smeulders. Early versus late fusion in semantic video analysis. In ACM Multimedia, 2005.
  • [44] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, 2012.
  • [45] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using LSTMs. In ICML, 2015.
  • [46] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In NIPS, 2012.
  • [47] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to Sequence Learning with Neural Networks. In NIPS, 2014.
  • [48] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [49] K. Tang, L. Fei-Fei, and D. Koller. Learning latent temporal structure for complex event detection. In CVPR, 2012.
  • [50] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. C3d: Generic features for video analysis. In ICCV, 2015.
  • [51] D. L. Vail, M. M. Veloso, and J. D. Lafferty. Conditional random fields for activity recognition. In AAAMS, 2007.
  • [52] A. Van den Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recommendation. In NIPS, 2013.
  • [53] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
  • [54] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, 2015.
  • [55] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [56] X. Wang, A. Farhadi, and A. Gupta. Actions ~ transformations. In CVPR, 2016.
  • [57] Y. Wang and G. Mori. Max-margin hidden conditional random fields for human action recognition. In CVPR, 2009.
  • [58] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue. Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In ACM Multimedia, 2014.
  • [59] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue. Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In ACM Multimedia, 2015.
  • [60] Z. Xu, Y. Yang, I. Tsang, N. Sebe, and A. Hauptmann. Feature weighting via optimal thresholding for video analysis. In ICCV, 2013.
  • [61] Y. Yang, J. Song, Z. Huang, Z. Ma, N. Sebe, and A. G. Hauptmann. Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE TMM, 2013.
  • [62] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, 2015.
  • [63] G. Ye, D. Liu, I.-H. Jhuo, and S.-F. Chang. Robust late fusion with rank minimization. In CVPR, 2012.
  • [64] S. Zha, F. Luisier, W. Andrews, N. Srivastava, and R. Salakhutdinov. Exploiting image-trained cnn architectures for unconstrained video classification. In BMVC, 2015.
  • [65] J. Zhang, M. Marszałek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV, 2007.