## 1 Introduction

Many challenging problems have been studied in computer vision research toward video understanding, such as video classification

[9, 13], video captioning [19, 20], video QA [21], and MovieQA [16], to name a few. YouTube-8M [1] is the largest video dataset for multi-label video classification. Its main problem is to predict the most relevant labels for a given video out of 4,716 predefined classes. Therefore, it requires jointly solving two important problems; video classification and multi-label classification.From the view of the video classification, YouTube-8M is challenging in that it covers more general classes like soccer, game, vehicle, and food,
while existing video classification datasets focus on more specific class groups, such as sports in Sports-1M [9], and actions in UCF-101 [13].
Therefore, unlike the importance of modeling motion features in UCF-101 [13] or Sports-1M [9], it is important to capture more generic video information(*e.g*. temporal encoding method for video, audio feature modeling) in YouTube-8M.

From the view of multi-label classification, the key issues to solve in YouTube-8M are label imbalances and correlations between labels. YouTube-8M involves 4,716 class labels, and the number of videos belonging to each class is significantly different, which causes a label imbalance issue that the classifier fits to the biased data. At the same time, many classes are closely related one another, such as {

Football, Kick, Penalty kick, Indoor soccer} or {Super Mario Bros, Super Mario World, Super Mario bros 3, Mario Kart, Mario Kart 8}. It is also challenging to resolve the correlations between labels to decide final prediction.Based on the challenges of the multi-label video classification task on YouTube-8M described the above,
we focus on addressing i) temporal encoding for video, ii) relieving the label imbalance problem, and iii) utilizing the correlated label information.
Our model consists of four components: i) video pooling layer, ii) classification layer, iii) label processing layer, and iv) loss function.
The proposed components indeed show significant performance improvement over the baseline models of YouTube-8M [1],
and finally our ensemble model is ranked 8th in the Google Cloud & YouTube-8M Video Understanding Challenge ^{1}^{1}1https://www.kaggle.com/c/youtube8m/leaderboard. (Team name: SNUVL X SKT).

## 2 The Model

Figure 1 shows the overall pipeline of our model. We first present video features we used, and then explain its four key components in the following sections.

### 2.1 Video Features

The inputs of the model are frame features and audio features of a video clip. The frame features are obtained by sampling a clip at 1-second interval, and extracting 2,048-dimensional vector from every frame through Inception-V3 Network

[15]pretrained to ImageNet

[4]. Then the feature is reduced to an 1,024-dimension via PCA (+ whitening), then quantized, and finally L2-normalized. As a result, for a video of seconds long, its frame is . The audio features re extracted using the VGG-inspired acoustic model [6] followed by L2-normalization, which is denoted by .As the input of our model, we concatenate the frame feature and the audio feature at every time step, denoted by . We test the compact bilinear pooling [5] with various dimensions between and , but all of them have significantly lower performance than the simple concatenation. From now on, we use to denote the input features of a video over all frames, and as -th frame vector.

### 2.2 Video Pooling Layer

The Video Pooling Layer is defined as a parametric function that encodes a sequence of feature vectors into a -dimensional embedding vector. We test five different encoding structures as follows.

#### 2.2.1 A Variant of LSTM

The Long Short Term Memory (LSTM) model

[7] is one of the most popular frameworks for modeling sequence data. We use a variant of the LSTM as follows:(1) | ||||

(2) | ||||

(3) | ||||

(4) | ||||

(5) | ||||

(6) |

where denotes each time step, are the input, forget, output gate, are long-term and short-term memory respectively.

The baseline model uses only the final hidden states of the LSTM (*i.e*. ), but we additionally exploit the following two states in order to extract as much information as possible from the LSTM: i) the state : the summation of the input feature of each time step, and ii) the state : the summation of the output of each time step of the LSTM.
We concatenate and with and .
That is, if the cell size of the LSTM is , the output of the baseline that uses becomes a -dimensional vector,
whereas our model that additionally uses and has the output of a -dimensional vector.
Experimentally, we choose the LSTM cell size = 1,152.
We also apply the layer normalization [2] to each layer of the LSTM for fast convergence, and also use the dropout with dropout rate=0.8 to increase the generalization capacity.

#### 2.2.2 CNNs

Convolutional neural networks (CNNs) are often used to jointly capture spatial information from images or video in many computer vision tasks.
That is, the convolution kernels generate output signals considering all the elements in the window together, and thus they effectively work with spatially or temporally sequential information (*e.g*. images, consecutive characters in NLP, and audio understanding).
As the second candidate of Video Pooling Layer, we use the CNN to capture temporal information of video as proposed in [10]:

(7) |

where conv(input, filter, bias) indicates convolution layer with stride 1, and ReLU indicates the element-wise ReLU activation

[12]. is a convolution filter with the vertical and horizontal filter size of , and is a bias.indicates the output of the convolution layer. Finally, we apply max-pooling over time for

, obtaining the -dimensional encoding of .#### 2.2.3 Position Encoding

We also test the Position Encoding scheme [14] that assigns different weights to each frame vector. That is, we define the matrix , Position Encoding is simply defined as follows.

(8) | |||

(9) |

where , means element-wise multiplication. After applying Position Encoding, we used summation of each vector as output of , that menas is -dimensional vector where d is 1,152.

#### 2.2.4 Self-Attention: Indirect Clustering

The YouTube-8M dataset deals with general topics (*e.g*. soccer, game, car, animation) rather than relatively focused labels like Sports-1M or UCF-101.
We here test the following hypothesis: since the topic is highly general, it may be more advantageous to focus on the most dominant parts of the video rather than temporal/motion information of individual frames.
Therefore, we suggest an indirect clustering model using the self-attention mechanism as follows.

We perform a clustering on the video features over all frames ,
and find the cluster with the largest size (*i.e*. the largest number of elements in the cluster).
Then, the vectors in this cluster may represent the main scene of the video.
However, since it takes very long time to perform clustering on each video, we propose a self-attention mechanism that acts like clustering as follows.

(10) |

where is a scalar value that indicates the soft attention to a frame while the softmax is applied over .
That is, the more similar the frame vector is to the other vectors, the higher its value is (*i.e*. it is more likely to be the main scene of the video).
Finally, the frame encoding is simply obtained by a weighted sum of frame vectors by the attention values: .

#### 2.2.5 Adaptive Noise

Each of 4,716 classes in the YouTube-8M dataset has a different number of video examples. For example, the Car class has about 800,000 examples, but the Air Gear class has only 101 examples. Let to be the number of labels associated with a video ; we introduce the adaptive noise structure to relieve the label imbalance problem as follows.

(11) | |||

(12) |

where noise Z

is sampled from normal distribution,

denotes the -th label for video, and is the number of video examples that have label . It means that we increase the generalization for small classes by adding more noise to their frame vectors. We then make summation of the vectors of all frames as the output of , as done in the position encoding.### 2.3 Classification Layer

Classification Layer is defined as follows.

(13) |

That is, by default, takes frame encoding as an input and outputs score for 4,716 classes. Exceptionally, as shown in figure 1, the Many-to-Many model has as its input, where is untouched by video pooling layer . We have conducted experiments on the following three structures in this component. (see figure 3(a))

#### 2.3.1 Many-to-Many

Unlike other models, Many-to-Many model has a frame vector that is not touched by video pooling layer . It uses LSTM similar to that of 2.2.1, but it calculates the score by attaching fully-connected layer to each output of each step in LSTM and average them, which is used as output . Since this model averages out the score in each frame, it has the temporal encoding ability of the RNN as well as the fact that the scores drawn in the more frequently appearing frames are reflected more. As a result, more effective video encoding could be performed.

#### 2.3.2 Variants of Mixture of Experts

The Mixtures of Experts [8] model is a binary classifier that adaptively takes into account the scores of several experts corresponding to a class. For one class, each expert

has a probability value between 0 and 1, and gate

represents the weight for each expert and is defined as follows.(14) | ||||

(15) |

where , are -dimensional vectors, scalars are biases, and softmax is performed for {}. ( is number of experts.) We extend this MoE model to multi-layer, and construct multiple fully-connected of defining probability and gate distribution of each expert. For example, the gate distribution and the expert distribution of the 2-layer MoE model are defined as follows.

(16) | ||||

(17) |

where each weight matrix , is denoted by , and each , means -dimensional vector. are biases. Finally, the score for i-th class is determined by the weighted sum of each expert distribution; .

#### 2.3.3 Multi Layer Perceptron

The Multi Layer Perceptron model is stack of Fully-Connected Layers, one of the most basic Neural Network structures. We experimentally set the number of layers to 3 and apply Layer Normalization

[2] to each layer.### 2.4 Label Processing Layer

Label Processing Layer is defined as follows.

(18) |

This component is designed to reflect the correlation between the labels into the model. For example, YouTube-8M, which includes {Soccer, Football, Kick, Indoor soccer} and {Super Mario Bros, Super Mario World, Super Mario bros 3, Mario Kart, Mario Kart 8}. In order to take advantage of this property, we set up the label correlation matrix as follows by counting all the videos in the training set. (See Figure 3(b))

(19) |

where is correlation matrix and the correlation value between the i-th label and the j-th label is calculated higher as i-th label and j-th label appear together more in the same video. Then, a new score is defined as follows to better reflect the correlation between labels through simple linear combination of matrix-vector multiplication.

(20) |

Here, is used as a fixed value, and is a trainable parameter initialized to the same value as . Scalar values

are hyperparameters for model.

### 2.5 Loss Function

#### 2.5.1 Center Loss

Center Loss was first proposed for face recognition task

[18] and expanded to other field because of its effectiveness for making dicriminative embedding feature [17]. The purpose of the center loss is to minimize intra-class variations while maximizing inter-class variations using the joint supervision of cross-entropy loss and center loss. The original center loss was used for the single label classification problem and it is hard to exploit it in a multi-label classification problem.If we convert the multi label problem into a single label problem like [3], increment of centers according to the combination of labels is the simple expansion of center loss to a multi-label classification problem. However, this simple expansion is not suitable for YouTube-8M, because the number of combination for labels is too big to calculate. Therefore, we modified the center loss to suit the multi-label classification problem as follow.

(21) | ||||

(22) |

Where denotes the number of labels in one video. denotes a embedding vector from penultimate layer, denotes the -th corresponding class center for each . The is the cross-entropy loss, is the center loss. A scalar is a hyperparameter for balancing the two loss functions.

#### 2.5.2 Pseudo-Huber Loss

The Huber Loss is a combination of L2 Loss and L1 Loss, which allows the model to be trained more robustly on noise instances. In the case of the YouTube-8M, there are classes with very few instances due to the label imbalance problem, and Huber Loss is designed to better learn instances belonging to these classes. For a simple, differentiable form, we use the Pseudo-Huber Loss function, a smooth approximation of the Huber Loss, as follows.

(23) |

where means that Cross-Entropy Loss between our prediction and ground-truth label , means hyperparameter for model.

### 2.6 Training

For training our model, we choose the Adam [11] optimizer using a batch size of 128, with learning rate = 0.0006, , , . We also apply learning rate decay with the rate =

for every 1.5M iterations. We train the model for 5 epochs with no early stopping. We use both official training and validation data that YouTube-8M public dataset provides for training our models.

## 3 Experiments

We use the test data from the Kaggle competition: Google Cloud & YouTube-8M Video Understanding Challenge to measure the performance of the model.
The source for our model is publicly available^{2}^{2}2https://github.com/seilna/youtube-8m.

### 3.1 Experimental Setting

One of the greatest features of our model is that we have three issues to solve the YouTube-8M classification task; i)temporal encoding for video, ii)label imbalance problem, iii)correlation between labels, and we have tried several variations on each component by dividing the model pipeline into four components; i)Frame Encoding, ii)Classification Layer, iii)Label Processing Layer, iv)Loss Function.
In addition, except in the Many-to-Many model in our pipeline, each component is completely independent of one another, so it is a very good structure for experimenting with a number of variant model combinations.
However, it is impossible to do brute-force experiments because the number of trials is too great to test all combinations of variations that each component can try.
Thus, we took the greedy approach, fixed the remaining 3 components in each component, and experimented the several structures only for that component, choosing the best-performing structure for each component.
We used Google Average Precision^{3}^{3}3https://www.kaggle.com/c/youtube8m#evaluation (GAP)@20 as a metric to measure the performance of the model.

Method | GAP@20 |
---|---|

`LSTM` |
0.811 |

`LSTM-M` |
0.815 |

`LSTM-M-O` |
0.820 |

`LSTM-M-O-LN` |
0.815 |

`CNN-64` |
0.704 |

`CNN-256` |
0.753 |

`CNN-1024` |
- |

`Position Encoding` |
0.782 |

`Indirect Clustering` |
0.801 |

`Adaptive Noise` |
0.782 |

### 3.2 Quantitative Results

#### 3.2.1 Video Pooling Layer

There are five transform structures in Video Pooling Layer ; i) variants of LSTM, ii) CNN, iii) Position Encoding, iv) Indirect Clustering, v) Adaptive Noise. To select the most suitable structure for , we fixed the rest of the component to the MoE-2 model, not used, and used the cross entropy loss. Based on these settings, the results for each of the five structures are shown in table 1.

`LSTM`

simply uses the last hidden state, `LSTM-M`

concatenates , `LSTM-M-O`

concatenates both and , and `LSTM-M-O-LN`

is a model that applies Layer Normalization for each layer of LSTM.
`CNN-64`

, `CNN-256`

, and `CNN-1024`

refer to models with 64, 256, and 1024 as the output channels of CNN, respectively.

The results show that the LSTM method has the best performance for encoding frames. Within LSTM, the higher the utilization of the internal information, the higher the performance, indicating that higher performance can be expected if more information can be extracted from other methods such as skip connection in LSTM. On the other hand, unlike the expectation, the LSTM with Layer Normalization has a lower performance than that which is not. Of course, Layer Normalization had the advantage of stable and fast convergence even when 20-30 times learning rate was applied, but when compared to the final performance alone, performance was not good.

CNN showed a very poor performance unexpectedly. As CNN’s output channel increased, performance was on the rise, but channels larger than 256 were not experimented because of the memory limitations of the GPU. While we can expect that CNN will perform well for larger channels, another problem with CNN is the dramatic increase in computation cost as channels increase. However, if CNNs with different hyperparameters record similar performance to LSTM, CNN is likely to be used as good pooling method because it has the advantage that the convolution operation is fully parallelizable.

The Position Encoding model showed lower performance than LSTM, and one of the possible reasons is that the sequence modeling power of the model is weaker than LSTM.

The Indirect Clustering model has lower performance than LSTM, but it performs better than the Position Encoding model. This suggests that the assumptions we have made (the importance of the main scene in video classification) are not entirely wrong, suggesting that we need a model that can cover this issue more delicately. Also, temporal encoding of LSTM is as important as considering main scene in video classification.

Adaptive Noise do not lead to a significant performance improvement, indicating that a more sophisticated approach to the label imbalance problem is required.

Method | GAP@20 |
---|---|

`Many-to-Many` |
0.791 |

`2 Layer MoE-2` |
0.424 |

`2 Layer MoE-16` |
0.421 |

`3 Layer MLP-4096` |
0.802 |

`3 Layer MLP-4096-LN` |
0.809 |

#### 3.2.2 Classification Layer

The Classification Layer has three variants: i)Many-to-Many, ii)Multi-Layer MoE, iii)MLP. To select the best method for , we fix the remaining component with Indirect Clustering, is not used, and uses cross entropy loss. (However, in the many-to-many model, is not used according to the definition.) Based on these settings, the results for each of the three structures are shown in Table 3.

`2 Layer MoE-2`

refers to a 2 layer MoE model with 2 experts, and `2 Layer MoE-16`

refers to a 2 layer MoE model with 16 experts.
`3 layer MLP-4096`

is MLP structure with 4096 dimension of each hidden layer and 3 hidden layers, and `3 layer MLP-4096-LN`

applies layer normalization to each layer.

In constructing the classification layer, the MLP structure showed the best performance among the three methods. The Many-to-Many model is an LSTM-based model, but has a lower performance than the basic LSTM structure in Table 2, indicating that the Many-to-Many framework in video classification is not always a good choice.

On the other hand, disappointingly, the Multi Layer MoE model showed severe overfitting in only two layers. As expected, this requires a lot of parameters to create an intermediate level of embedding, which is also needed for each experts, resulting in overfitting for many parameters. We experimented with different number of layers and hidden layer dimensions for the MLP structure, but 4096 dimensions of 3 layers showed the highest performance. Unusual is that, when layer normalization was applied, the MLP model showed improved performance unlike LSTM, and did not require more delicate hyperparameter tuning.

#### 3.2.3 Label Processing Layer

Method | GAP@20 |
---|---|

`MoE - (1.0, 0.3, 0.0)` |
0.784 |

`MoE - (1.0, 0.1, 0.0)` |
0.787 |

`MoE - (1.0, 0.0, 0.1)` |
0.788 |

`MoE - (1.0, 0.01, 0.0)` |
0.790 |

`MoE - (1.0, 0.0, 0.01)` |
0.790 |

`MoE - (1.0, 0.01, 0.01)` |
0.788 |

To select the best method for , we fix the remaining component with Indirect Clustering, using the MoE-16 model and using the cross entropy loss. As defined in Equation 20, the Label Processing Layer uses pre-computed correlation matrix . , and values can be used to control the degree to which the correlation matrix affects the class score. An experiment is shown in Table 4.

As a result, performance dropped for all and values greater than zero. In other words, when using the label correlation matrix for the purpose of score update, it showed rather low classification performance.

The possible reason for the result may be that our model is too naive to reflect the correlation label prior and that label correlation in the YouTube-8M Dataset is not strong enough to improve the performance of the classification task.

#### 3.2.4 Loss Function

Method | GAP@20 |
---|---|

0.798 | |

+ () | 0.799 |

() | 0.803 |

() | 0.801 |

() | 0.798 |

() | 0.794 |

To select the most suitable loss for , we fix the remaining component with Indirect Clustering, uses the MoE-16 model, and is not used.

Table 5 shows the performance depending on whether or not the center loss and the pseudo-huber loss are used. is the cross entropy loss, is the center loss, and is the Pseudo-Huber loss for cross entropy term.

The results show that, firstly, using the center loss term gives little performance gain.
Considering that we apply the center loss term to the multi-label, the performance should be improved if the labels are correlated with each other.
Here we get another insight for the correlation label prior, which is that the correlation does not exist as strongly as to achieve performance improvement.
Second, Huber Loss proved its ability to cover noise data somewhat from the YouTube video annotation system^{4}^{4}4https://www.youtube.com/watch?v=wf_77z1H-vQ by recording a relatively clear performance improvement.

### 3.3 Ensemble Model

Based on the experiments we conducted in section 3.2, we recorded test performance of 0.820 as a single model combining LSTM-M-O, MoE-2, and HuberLoss.(Due to GPU memory limitation, we could not apply MoE-16 or Multi Layer MLP model to LSTM based model) In addition, our ensemble model, which is a simple average of the scores of several models, showed a test performance of 0.839 and ranked 8th in the kaggle challenge. The interesting thing is that when we assemble several models, we did not get a significant performance increase if we combine the better single models. Rather, for increasing ensemble model’s performance, the models incorporated should be as diverse as possible.

## 4 Conclusion

We defined three issues that need to be covered in order to solve the YouTube-8M Video Classification task, and we divided the model pipeline into four components and experimented with the various structures to solve issues in each component. As a result, almost all of the deformed structures tried to perform better than the baseline, and their ensemble model recorded 0.839 in the test performance and 8th in the kaggle challenge. We also provided insights on the structures we tried on each component, on what roles each structure plays, and why they work well or poor. Based on this insight, we will explore ways of better frame encoding or use more elegant label correlation priorities as our future work.

## References

- [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
- [2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- [3] A. C. de Carvalho and A. A. Freitas. A tutorial on multi-label classification techniques. In Foundations of Computational Intelligence. Springer, 2009.
- [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- [5] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
- [6] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. Cnn architectures for large-scale audio classification. arXiv preprint arXiv:1609.09430, 2016.
- [7] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.
- [8] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation, 1994.
- [9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
- [10] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014.
- [11] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [12] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
- [13] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- [14] S. Sukhbaatar, J. Weston, R. Fergus, et al. End-to-end memory networks. In NIPS, 2015.
- [15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
- [16] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question-answering. In CVPR, 2016.
- [17] H. Wang, Z. Li, X. Ji, and Y. Wang. Face r-cnn. arXiv preprint arXiv:1706.01061, 2017.
- [18] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, 2016.
- [19] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
- [20] Y. Yu, H. Ko, J. Choi, and G. Kim. Video captioning and retrieval models with semantic attention. arXiv preprint arXiv:1610.02947, 2016.
- [21] L. Zhu, Z. Xu, Y. Yang, and A. G. Hauptmann. Uncovering temporal context for video question and answering. arXiv preprint arXiv:1511.04670, 2015.

Comments

There are no comments yet.