Kaggle Youtube 8M WILLOW approach
Common video representations often deploy an average or maximum pooling of pre-extracted frame features over time. Such an approach provides a simple means to encode feature distributions, but is likely to be suboptimal. As an alternative, we here explore combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors, NetVLAD, GRU and LSTM to aggregate video features over time. We also introduce a learnable non-linear network unit, named Context Gating, aiming at modeling interdependencies between features. We evaluate the method on the multi-modal Youtube-8M Large-Scale Video Understanding dataset using pre-extracted visual and audio features. We demonstrate improvements provided by the Context Gating as well as by the combination of learnable pooling methods. We finally show how this leads to the best performance, out of more than 600 teams, in the Kaggle Youtube-8M Large-Scale Video Understanding challenge.READ FULL TEXT VIEW PDF
Leveraging both visual frames and audio has been experimentally proven
This paper describes our solution for the video recognition task of the
Mel-filterbanks are fixed, engineered audio features which emulate human...
We introduce modifications to state-of-the-art approaches to aggregating...
Frame-level visual features are generally aggregated in time with the
This paper introduces the system we developed for the Youtube-8M Video
This paper introduces a fast and efficient network architecture, NeXtVLA...
Kaggle Youtube 8M WILLOW approach
Understanding and recognizing video content is a major challenge for numerous applications including surveillance, personal assistance, smart homes, autonomous driving, stock footage search and sports video analysis. In this work, we address the problem of multi-label video classification for user-generated videos on the Internet. The analysis of such data involves several challenges. Internet videos have a great variability in terms of content and quality (see Figure 1). Moreover, user-generated labels are typically incomplete, ambiguous and may contain errors.
Current approaches for video analysis typically represent videos by features extracted from consecutive frames, followed by feature aggregation over time. Example methods for feature extraction include deep convolutional neural networks (CNNs) pre-trained on static images[1, 2, 3, 4]. Representations of motion and appearance can be obtained from CNNs pre-trained for video frames and short video clips [5, 6], as well as hand-crafted video features [7, 8, 9]. Other more advanced models employ hierarchical spatio-temporal convolutional architectures [10, 11, 12, 13, 5, 14] to both extract and temporally aggregate video features at the same time.
Common methods for temporal feature aggregation include simple averaging or maximum pooling as well as more sophisticated pooling techniques such as VLAD  or more recently recurrent models (LSTM  and GRU ). These techniques, however, may be suboptimal. Indeed, simple techniques such as average or maximum pooling may become inaccurate for long sequences. Recurrent models are frequently used for temporal aggregation of variable-length sequences [18, 19] and often outperform simpler aggregation methods, however, their training remains cumbersome. As we show in Section 5, training recurrent models requires relatively large amount of data. Moreover, recurrent models can be sub-optimal for processing of long video sequences during GPU training. It is also not clear if current models for sequential aggregation are well-adapted for video representation. Indeed, our experiments with training recurrent models using temporally-ordered and randomly-ordered video frames show similar results.
Another research direction is to exploit traditional orderless aggregation techniques based on clustering approaches such as Bag-of-visual-words [20, 21], Vector of Locally aggregated Descriptors (VLAD)  or Fisher Vectors . It has been recently shown that integrating VLAD as a differentiable module in a neural network can significantly improve the aggregated representation for the task of place retrieval . This has motivated us to integrate and enhance such clustering-based aggregation techniques for the task of video representation and classification.
Contributions. In this work we make the following contributions: (i) we introduce a new state-of-the-art architecture aggregating video and audio features for video classification, (ii) we introduce the Context Gating layer, an efficient non-linear unit for modeling interdependencies among network activations, and (iii) we experimentally demonstrate benifits of clustering-based aggregation techniques over LSTM and GRU approaches for the task of video classification.
Results. We evaluate our method on the large-scale multi-modal Youtube-8M V2 dataset containing about 8M videos and 4716 unique tags. We use pre-extracted visual and audio features provided with the dataset  and demonstrate improvements obtained with the Context Gating as well as by the combination of learnable poolings. Our method obtains top performance, out of more than 650 teams, in the Youtube-8M Large-Scale Video Understanding challenge111https://www.kaggle.com/c/youtube8m. Compared to the common recurrent models, our models are faster to train and require less training data. Figure 1 illustrates some qualitative results of our method.
This work is related to previous methods for video feature extraction, aggregation and gating reviewed below.
Successful hand-crafted representations [7, 8, 9] are based on local histograms of image and motion gradient orientations extracted along dense trajectories [24, 9]. More recent methods extract deep convolutional neural network activations computed from individual frames or blocks of frames using spatial [6, 25, 26, 27] or spatio-temporal [10, 11, 12, 13, 5, 14] convolutions. Convolutional neural networks can be also applied separately on the appearance channel and the pre-computed motion field channel resulting in the, so called, two-stream representations [11, 6, 26, 28, 14]. As our work is motivated by the Youtube-8M large-scale video understanding challenge , we will assume for the rest of the paper that features are provided (more details are provided in Section 5). This work mainly focuses on the temporal aggregation of given features.
Video features are typically extracted from individual frames or short video clips. The remaining question is: how to aggregate video features over the entire and potentially long video? One way to achieve this is to employ recurrent neural networks, such as long short-term memory (LSTM)  or gated recurrent unit (GRU) ), on top of the extracted frame-level features to capture the temporal structure of video into a single representation [29, 18, 30, 31, 32]. Hierarchical spatio-temporal convolution architectures [10, 11, 12, 13, 5, 14] can also be viewed as a way to both extract and aggregate temporal features at the same time. Other methods capture only the distribution of features in the video, not explicitly modeling their temporal ordering. The simplest form of this approach is the average or maximum pooling of video features  over time. Other commonly used methods include bag-of-visual-words [20, 21], Vector of Locally aggregated Descriptors (VLAD)  or Fisher Vector  encoding. Application of these techniques to video include [7, 34, 8, 9, 35]. Typically, these methods [31, 36]
rely on an unsupervised learning of the codebook. However, the codebook can also be learned in a discriminative manner[34, 37, 38] or the entire encoding module can be included within the convolutional neural network architecture and trained in the end-to-end manner . This type of end-to-end trainable orderless aggregation has been recently applied to video frames in . Here we extend this work by aggregating visual and audio inputs, and also investigate multiple orderless aggregations.
Gating mechanisms allow multiplicative interaction between a given input feature and a gate vector with values in between 0 and 1. They are commonly used in recurrent neural network models such as LSTM  and GRU  but have so far not been exploited in conjunction with other non-temporal aggregation strategies such as Fisher Vectors (FV), Vector of Locally Aggregated Descriptors (VLAD) or bag-of-visual-words (BoW). Our work aims to fill this gap and designs a video classification architecture combining non-temporal aggregation with gating mechanisms. One of the motivations for this choice is the recent Gated Linear Unit (GLU) 
, which has demonstrated significant improvements in natural language processing tasks.
Our architecture for video classification is illustrated in Figure 2 and contains three main modules. First, the input features are extracted from video and audio signals. Next, the pooling module aggregates the extracted features into a single compact (e.g. 1024-dimensional) representation for the entire video. This pooling module has a two-stream architecture treating visual and audio features separately. The aggregated representation is then enhanced by the Context Gating layer (section 3.1). Finally, the classification module takes the resulting representation as input and outputs scores for a pre-defined set of labels. The classification module adopts the Mixture-of-Experts  classifier as described in , followed by another Context Gating layer.
The Context Gating (CG) module transforms the input feature representation into a new representation as
where is the input feature vector, is the element-wise sigmoid activation and is the element-wise multiplication. and are trainable parameters. The vector of weights represents a set of learned gates applied to the individual dimensions of the input feature .
The motivation behind this transformation is two-fold. First, we wish to introduce non-linear interactions among activations of the input representation. Second, we wish to recalibrate the strengths of different activations of the input representation through a self-gating mechanism. The form of the Context Gating layer is inspired by the Gated Linear Unit (GLU) introduced recently for language modeling  that considers a more complex class of transformations given by , with two sets of learnable parameters , and , . Compared to the the Gated Linear Unit , our Context Gating in (1) (i) reduces the number of learned parameters as only one set of weights is learnt, and (ii) re-weights directly the input vector
(instead of its linear transformation) and hence is suitable for situations wherehas a specific meaning, such the score of a class label, that is preserved by the layer. As shown in Figure 2, we use Context Gating in the feature pooling and classification modules. First, we use CG to transform the feature vector before passing it to the classification module. Second, we use CG after the classification layer to capture the prior structure of the output label space. Details are provided below.
Residual connections have been introduced in . They demonstrate faster and better training of deep convolutional neural networks as well as better performance for a variety of tasks. Residual connections can be formulated as
where are the input features, the learnable parameters of the linear mapping (or it can be a convolution),
is a non-linearity (typically Rectifier Linear Unit as expressed in). One advantage of residual connections is the possibility of gradient propagation directly into
during training, avoiding the vanishing gradient problem. To show this, the gradient of the residual connection can be written as:
One can notice that the gradient is the sum of the gradient of the previous layer and the gradient . The vanishing gradient problem is overcome thanks to the term
, which allows the gradient to backpropagate directly fromto without decreasing in the norm. A similar effect is observed with Context Gating which has the following gradient equation:
In this case, the term is weighted by activations . Hence, for dimensions where are close to 1, gradients are directly propagated from to . In contrast, for values close to 0 the gradient propagation is vanished. This property is valuable as it allows to stack several non-linear layers and avoid vanishing gradient problems.
Our goal is to predict human-generated tags for a video. Such tags typically represent only a subset of objects and events which are most relevant to the context of the video. To mimic this behavior and to suppress irrelevant labels, we introduce the Context Gating module both to re-weight the features and the output labels of our architecture.
Capturing dependencies among features. Context Gating can help creating dependencies between visual activations. Take an example of a skiing video showing a skiing person, snow and trees. While network activations for the Tree features might be high, trees might be less important in the context of skiing where people are more likely to comment about the snow and skiing rather than the forest. Context Gating can learn to down-weight visual activations for Tree when it co-occurs with visual activations for Ski and Snow as illustrated in Figure 3.
Capturing prior structure of the output space.
Context Gating can also create dependencies among output class scores when applied to the classification layer of the network. This makes it possible to learn a prior structure on the output probability space, which can be useful in modelingbiases in label annotations.
Within our video classification architecture described above, we investigate several types of learnable pooling models, which we describe next. Previous successful approaches [18, 19] employed recurrent neural networks such as LSTM or GRU for the encoding of the sequential features. We chose to focus on non-recurrent aggregation techniques. This is motivated by several factors: first, recurrent models are computationally demanding for long temporal sequences as it is not possible to parallelize the sequential computation. Moreover, it is not clear if treating the aggregation problem as a sequence modeling problem is necessary. As we show in our experiments, there is almost no change in performance if we shuffle the frames in a random order as almost all of the relevant signal relies on the static visual cues. All we actually need to do is to find a way to efficiently remember all of the relevant visual cues. We will first review the NetVLAD  aggregation module and then explain how we can exploit the same idea to imitate Fisher Vector and Bag-of-visual-Words aggregation scheme.
, but in a differentiable manner, where the clusters are tuned via backpropagation instead of using k-means clustering. It was then extended to action recognition in video. The main idea behind NetVLAD is to write the descriptor hard assignment to the cluster as a soft assignment:
where and are learnable parameters. In other words, the soft assignment of descriptor to cluster measures on a scale from to how close the descriptor is to cluster . In the hard assignment way, would be equal to if closest cluster is cluster and otherwise. For the rest of the paper, will define soft assignment of descriptor to cluster . If we write the j-th learnable cluster, the NetVLAD descriptor can be written as
which computes the weighted sum of residuals of descriptors from learnable anchor point in cluster .
For bag-of-visual-words (BOW) encoding, we use soft-assignment of descriptors to visual word clusters [23, 43] to obtain a differentiable representation. The differentiable BOW representation can be written as:
Notice that the exact bag-of-visual-words formulation is reproduced if we replace the soft assignment values by its hard assignment equivalent. This formulation is closely related to the Neural BoF formulation , but differs in the way of computing the soft assignment. In detail,  performs a softmax operation over the computed L2 distances between the descriptors and the cluster centers, whereas we use soft-assignment given by eq. (5) where parameters are learnable without explicit relation to computing L2 distance to cluster centers. It also relates to  that uses a recurrent neural network to perform the aggregation. The advantage of BOW aggregation over NetVLAD is that it aggregates a list of feature descriptors into a much more compact representation, given a fixed number of clusters. The drawback is that significantly more clusters are needed to obtain a rich representation of the aggregated descriptors.
Inspired by Fisher Vector  encoding, we also experimented with modifying the NetVLAD architecture to allow learning of second order feature statistics within the clusters. We will denote this representation as NetFV (for Net Fisher Vectors) as it aims at imitating the standard Fisher Vector encoding . Reusing the previously established soft assignment notation, we can write the NetFV representation as
where is capturing the first-order statistics, is capturing the second-order statistics, are the learnable clusters and are the clusters’ diagonal covariances. To define
as positive, we first randomly initialize their value with a Gaussian noise with unit mean and small variance and then take the square of the values during training so that they stays positive. In the same manner as NetVLAD,and are learnt independently from the parameters of the soft-assignment . This formulation differs from [46, 38] as we are not exactly reproducing the original Fisher Vectors. Indeed the parameters and are decoupled from each other. As opposed to [46, 38]
, these parameters are not related to a Gaussian Mixture Model but instead are trained in a discriminative manner.
Finally, we have also investigated a simplification of the original NetVLAD architecture that averages the actual descriptors instead of residuals, as first proposed by . We call this variant NetRVLAD (for Residual-less VLAD). This simplification requires less parameters and computing operations (about half in both cases). The NetRVLAD descriptor can be written as
More information about our Tensorflow implementation of these different aggregation models can be found at: https://github.com/antoine77340/LOUPE
This section evaluates alternative architectures for video aggregation and presents results on the Youtube-8M  dataset.
The Youtube-8M dataset  is composed of approximately 8 millions videos. Because of the large scale of the dataset, visual and audio features are pre-extracted and provided with the dataset. Each video is labeled with one or multiple tags referring to the main topic of the video. Figure 5 illustrates examples of videos with their annotations. The original dataset is divided into training, validation and test subsets with 70%, 20% and 10% of videos, respectively. In this work we keep around 20K videos for the validation, the remaining samples from the original training and validation subsets are used for training. This choice was made to obtain a larger training set and to decrease the validation time. We have noticed that the performance on our validation set was comparable (%-% higher) to the test performance evaluated on the Kaggle platform. As we have no access to the test labels, most results in this section are reported for our validation set. We report evaluation using the Global Average Precision (GAP) metric at top 20 as used in the Youtube-8M Kaggle competition (more details about the metric can be found at: https://www.kaggle.com/c/youtube8m#evaluation).
In the Youtube 8M competition dataset 
video and audio features are provided for every second of the input video. The visual features consist of ReLU activations of the last fully-connected layer from a publicly available222https://www.tensorflow.org/tutorials/image_recognition
Inception network trained on Imagenet. The audio features are extracted from a CNN architecture trained for audio classification. PCA and whitening are then applied to reduce the dimension to 1024 for the visual features and 128 for the audio features. More details on feature extraction are available in .
All of our models are trained using the Adam algorithm  and mini-batches with data from around 100 videos. The learning rate is initially set to and is then decreased exponentially with the factor of51] before each non-linear layer.
For the clustering-based pooling models, i.e. BoW, NetVLAD, NetRVLAD and NetFV, we randomly sample features with replacement from each video. is fixed for all videos at training and testing. As opposed to the original version of NetVLAD , we did not pre-train the codebook with a k-means initialization as we did not notice any improvement by doing so. For training of recurrent models, i.e. LSTM and GRU, we process features in the temporal order. We have also experimented with the random sampling of frames for LSTM and GRU which performs surprisingly similarly.
All our models are trained with the cross entropy loss. Our implementation uses the TensorFlow framework . Each training is performed on a single NVIDIA TITAN X (12Gb) GPU.
Average pooling + Logistic Regression
|Average pooling + MoE + CG|
|LSTM (2 Layers)|
|GRU (2 Layers)|
|BoW (4096 Clusters)|
|NetFV (128 Clusters)|
|NetRVLAD (256 Clusters)|
|NetVLAD (256 Clusters)|
|Gated BoW (4096 Clusters)|
|Gated NetFV (128 Clusters)|
|Gated NetRVLAD (256 Clusters)|
|Gated NetVLAD (256 Clusters)||83.2%|
We evaluate the performance of individual models in Table I. To enable a fair comparison, all pooled representations have the same size of 1024 dimensions. The “Gated” versions for the clustering-based pooling methods include CG layers as described in Section 3.1. Using CG layers together with GRU and LSTM has decreased the performance in our experiments.
From Table I we can observe a significant increase of performance provided by all learnt aggregation schemes compared to the Average pooling baselines. Interestingly, the NetVLAD and NetFV representations based on the temporally-shuffled feature pooling outperforms the temporal models (GRU and LSTM). Finally, we can note a consistent increase in performance provided by the Context Gating for all clustering-based pooling methods.
Table II reports an ablation study evaluating the effect of Context Gating on the NetVLAD aggregation with 128 clusters. The addition of CG layers in the feature pooling and classification modules gives a significant increase in GAP. We have observed a similar behavior for NetVLAD with 256 clusters. We also experimented with replacing the Context Gating by the GLU  after pooling. To make the comparison fair, we added a Context Gating layer just after the MoE. Despite being less complex than GLU, we observe that CG also performs better. We note that the improvement of % provided by CG is similar to the improvement of the best non-gated model (NetVLAD) over LSTM in Table I.
In addition to the late fusion of audio and video streams (Late Concat) described in Section 3, we have also experimented with a simple concatenation of original audio and video features into a single vector, followed by the pooling and classification modules in a “single stream manner” (Early Concat). Results in Table III illustrate the effect of the two fusion schemes for different pooling methods. The two-stream audio-visual architecture with the late fusion improves performance for the clustering-based pooling methods (NetVLAD and NetFV). On the other hand, the early fusion scheme seems to work better for GRU and LSTM aggregations. We have also experimented with replacing the concatenation fusion of audio-video features by their outer product. We found this did not work well compared to the concatenation mainly due to the high dimensionality of the resulting output. To alleviate this issue, we tried to reduce the output dimension using the multi-modal compact bilinear pooling approach  but found the resulting models underfitting the data.
|After pooling||After MoE||GAP|
|Gated Linear Unit||-|
|Gated Linear Unit||Context Gating|
|Context Gating||Context Gating||83.0%|
|Method||Early Concat||Late Concat|
One valuable feature of the Youtube-8M dataset is the large scale of annotated data (almost 10 millions videos). More common annotated video datasets usually have sizes several orders of magnitude lower, ranging from 10k to 100k samples. With the large-scale dataset at hand we evaluate the influence of the amount of training data on the performance of different models. To this end, we experimented with training different models: Gated NetVLAD, NetVLAD, LSTM and average pooling based model on multiple randomly sampled subsets of the Youtube 8M dataset. We have experimented with subsets of 70K, 150K, 380K and 1150K samples. For each subset size, we have trained models using three non-overlapping training subsets and measured the variance in performance. Figure 4 illustrates the GAP performance of each model when varying the training size. The error bars represent the variance observed when training the models on the three different training subsets. We have observed low and consistent GAP variance for different models and training sizes. Despite the LSTM model has less parameters (around 40M) compared to NetVLAD (around 160M) and Gated NetVLAD (around 180M), NetVLAD and Gated NetVLAD models demonstrate better generalization than LSTM when trained from a lower number of samples. The Context Gating module still helps generalizing better the basic NetVLAD based architecture when having sufficient number of samples (at least 100k samples). We did not show results with smaller dataset sizes as the results for all models were drastically dropping down. This is mainly due to the fact that the task is a multi-label prediction problem with a large pool of roughly 5000 labels. As these labels have a long-tail distribution, decreasing the dataset size to less than 30k samples would leave many labels with no single training example. Thus, it would not be clear if the drop of performance is due to the aggregation technique or a lack of training samples for rare classes.
We explore the complementarity of different models and consider their combination through ensembling. Our ensemble consists of several independently trained models. The ensembling averages label prediction scores of selected models. We have observed the increased effect of ensembling when combining diverse models. To choose models, we follow a simple greedy approach: we start with the best performing model and choose the next model by maximizing the GAP of the ensemble on the validation set. Our final ensemble used in the Youtube 8M challenge contains 25 models. A seven models ensemble is enough to reach the first place with a GAP on the private test set of 84.688. These seven models correspond to: Gated NetVLAD (256 clusters), Gated NetFV (128 clusters), Gated BoW (4096 Clusters), BoW (8000 Clusters), Gated NetRVLAD (256 Clusters), GRU (2 layers, hidden size: 1200) and LSTM (2 layers, hidden size: 1024). Our code to reproduce this ensemble is available at: https://github.com/antoine77340/Youtube-8M-WILLOW. To obtain more diverse models for the final 25 ensemble, we also added all the non-Gated models, varied the number of clusters or varied the size of the pooled representation.
Table IV shows the ensemble size of the other top ranked approaches, out of 655 teams, from the Youtube-8M kaggle challenge. Besides showing the best performance at the competition, we also designed a smaller set of models that ensemble more efficiently than others. Indeed, we need much less models in our ensemble than the other top performing approaches. Full ranking can be found at: https://www.kaggle.com/c/youtube8m/leaderboard.
We have addressed the problem of large-scale video tagging and explored trainable variants of classical pooling methods (BoW, VLAD, FV) for the temporal aggregation of audio and visual features. In this context we have observed NetVLAD, NetFV and BoW to outperform more common temporal models such as LSTM and GRU. We have also introduced the Context Gating mechanism and have shown its benefit for the trainable versions of BoW, VLAD and FV. The ensemble of our individual models has been shown to improve the performance further, enabling our method to win the Youtube 8M Large-Scale Video Understanding challenge. Our TensorFlow toolbox LOUPE is available for download from  and includes implementations of the Context Gating as well as learnable pooling modules used in this work.
The authors would like to thank Jean-Baptiste Alayrac and Relja Arandjelović for valuable discussions as well as the Google team for providing the Youtube-8M Tensorflow Starter Code. This work has also been partly supported by ERC grants ACTIVIA (no. 307574) and LEAP (no. 336845), CIFAR Learning in Machines Brains program, ESIF, OP Research, development and education Project IMPACT No. CZ and a Google Research Award.
M. Baccouche, F. Mamalet, C. Wolf, C. Garcia, and A. Baskurt, “Sequential deep learning for human action recognition,”Human Behavior Understanding, pp. 29–39, 2011.
N. Passalis and A. Tefas, “Learning neural bag-of-features for large scale image retrieval,”IEEE Trans. Cybernetics, 2017.