1 Introduction
Enormous amount of video content is generated all over the world every day. As an important research topic in computer vision, video analysis has many applications such as recommendation, search, and ranking. Recently, video classification problem gained interest with broad range of applications such as emotion recognition [2], human activity understanding [3], and event detection [4].
YouTube8M dataset [5] released by Google AI consists of over 6 million YouTube videos of 2.6 billion audio and visual features with 3,700+ of associated visual entities on average of 3.0 labels per video. Each video was decoded at 1 framepersecond up to the first 360 seconds, after which features were extracted via pretrained model. PCA and quantization were further applied to reduce dimensions and data size. Visual features of 1024 dimensions and audio features of 128 dimensions were extracted on each frame as input for downstream classifiers.
Following the first YouTube8M Kaggle competition, the second one is focused on developing compact models no greater than 1 GB uncompressed so that it can be applicable on user’s personal mobile phones for personalized and privacypreserving computation. Challenges in such competition include modeling correlations between labels, handling multiple sequential framelevel features, and efficient model compression.
In the competition, Global Average Precision (GAP) at 20 is used as metric. For each video, the model should predict 20 most confident labels with associated confidence (probability). The list of
tuples is sorted by confidence levels in descending order. GAP is then computed as:(1) 
where is the precision, and is the recall.
Common approach for video analysis typically extract features from consecutive features followed by feature aggregation. Framelevel feature extraction can be achieved by applying pretrained Convolutional Neural Networks (CNN)
[6, 7, 8, 9]. Common methods for temporal frame feature aggregation include Bagofvisualwords [10, 11], Fisher Vectors
[12], Convolutional Neural Networks (CNN) [13], Gated Recurrent Unit (GRU)
[14], Long ShortTerm Memory (LSTM)
[15], and Generalized Vector of Locally Aggregated Descriptors (NetVLAD) [16].It is wellknown that neural networks are memory intensive and deploying such models on mobile devices is difficult for its limited hardware resources. Several approaches were proposed to tackle such difficulty. A straight forward way is to apply tensor decomposition techniques
[17, 18, 19] to a pretrained CNN model [20]. Network Pruning removes lowweight connections on pretrained models [21], or gradually trains binary masks and weights until target sparsity is reached [22]. Network quantization compresses network by reducing number of bits required to represent each weight via weight sharing [23], or vector quantization [24]. Another way to get better performance for limitedsize model is to use knowledge distillation [25, 26]. The idea behind it is to train student network (small size) to imitate the soft output of a larger teacher network or ensembles of networks.In Google cloud & YouTube8M video understanding challenge Kaggle competition, top participates [1, 27, 28, 29, 30] trained models such as Gated NetVLAD, GRU, and LSTM with Attention. To leverage the predictability of single models, they averaged checkpoints at different training steps and ensembled predicted probability scores by weighted average, bagging, or boosting.
Our contribution in this paper is threefold: First, we explore size and performance of Gated NetVLAD under different sets of hyperparameters (cluster size and hidden size). Second, we develop ensemble approach of multiple models in one tensorflow graph which avoids inplace change of graph. Third, we cast trained weights tensors from float32 to float16 data type in evaluation and inference stage which reduces approximately half the model size without sacrificing performance.
The rest of the paper is organized as follows. Section 2 presents our model architecture with compression and ensemble approaches. Section 3 reports experimental results, followed by conclusions in Section 4.
2 Approach
In this section, we first introduce Gated NetVLAD model architecture, then describe model compression approaches we have tried, followed by ensemble approaches we developed.
2.1 Frame Level Models
Our architecture for video classification is illustrated in Fig. 1
2.1.1 NetVLAD
The NetVLAD [16] is a trainable generalized VLAD layer that captures information about the statistics of local descriptors over the image, i.e., the sum of residuals for each visual word. More specifically, let th descriptor (video or audio feature) of a video be , which can be assign to one of clusters with centroid for . The NetVLAD can be written as summation of residuals with soft assignment
(2) 
where , and are sets of trainable parameters for each cluster . Number of clusters (referred as cluster size) is a hyperparameter we varies across different models.
2.1.2 Fully Connected layer
Our FC layer consists of two layers. First layer get input of concatenated video and audio VLAD layer, multiplied by weight matrix resulting in hidden size
, followed by batch normalization and ReLU activation. Second layer takes output of first layer as input, multiplied by weight matrix of shape
and added by a bias term.2.1.3 Context Gating
Context Gating (CG) [1] is a learnable nonlinear unit aiming to model interdependencies among network activations by gating. Concretely, CG transform input vector to output vector of same dimension via
(3) 
where and are trainable parameters. is elementwise sigmoid activation and is the elementwise multiplication. CG is known for being able to capture dependencies among features.
2.1.4 Mixture of Experts
Mixture of Experts (MoE) [31] consists of two parts, gating and experts. The final predictions are sum of products of last layers from gating and experts. We use 5 mixtures of one hidden layer experts. Gate activations are multiplied to each expert for probability prediction. MoE is further followed by CG modeling dependencies among video vocabulary labels.
2.1.5 Training
We use two local GPUs, Nvidia 1080Ti and Nvidia 1060, and two Google Cloud Platform accounts with Tesla K80 GPUs to train single models. The training time is two to three days per model for 200k steps with batch size between 80 and 160.
The YouTube8M Dataset is partitioned into three parts: Train, Validate and Test. Both Train and Validate data come with labels so they are effectively exchangeable. In order to maximize the number of training samples as well as to speed up the evaluation step, we randomly chose 60 out of the 3844 validate files as our validate set, and combine the other validate files with the official train dataset as our training set. We observe constant delta between our validate set and public leaderboard.
2.2 Compression
2.2.1 Float16 Compression
Of the compression methods we have attempted, this is the only one that worked. The idea is to train all the models in the default float32 precision to achieve maximum score, then at evaluation/inference stage, cast all the inference weight tensors to float16 precision to cut model size in half while preserving prediction accuracy.
In actual implementation, we only cast 4 out of the 32 weight tensors in our model architecture. This is because the largest 4 tensors (
‘tower/experts/weights’, ‘tower/gates/weights’, ‘tower/gating_prob_weights’, ‘tower/hidden1_weights/hidden1_weights’
) make up about 98% of total model size. Modifying the other 28 small weight tensors does not worth the effort in our opinion, since float32 is the default data type in many tensorflow modules and functions, and we had to extend tensorflow’s core.Dense class and layers.fully_connected function to support float16 precision in order to cast those 4 tensors.
The average compression rate is 48.5% across models, as shown in Table 1. Compared to the original float32 models, the float16 compression version perform equally well, with GAPs differing less than 0.0001, which is the level of randomness between different evaluation runs for the same float32 model.
2.2.2 Gradual Binary Mask Pruning
We tried method introduced by [22] as the tensorflow implementation is available on tensorflow’s github repo. For every layer chosen to be pruned, the authors added a binary mask variable of the same shape to determine which elements participate in forward execution of the graph. They introduced gradual pruning algorithm updating binary weight masks along with weight in network training. They claimed to achieve high compression rate without losing significant accuracy.
However, we found two main difficulties of the method in our experiments:

The sparsity is not identical to compression rate. In the article, sparsity was referred as the ratio between number of nonzero elements of the pruned network and the original network, while after compression the sparse tensor has indices that takes additional storage. Although the authors considered bitmask and CSR(C) sparse matrix representation, there are still two problems we could not solve easily: First, Tensorflow boolean variable takes one byte (not one bit) thus in real implementation the compression rate will be much lower. Second, for large tensors with huge indices range, row and column indices should have each element of type 32 or 64 bit integers, which takes a huge storage.
For example, if we use float32 to store a tensor of 1024x100000 and set sparsity as . To make use of SparseTensor object in Tensorflow, for each nonzero element, we need to associate it with two 32 bit integers (row and column index). This results in only compression rate though sparsity is set to be . Furthermore, it losses of nonzero elements and sacrifices accuracy with only compression rate. This is not as appealing as float16 compression approach.

At the time of competition deadline, the github repo ^{1}^{1}1https://github.com/tensorflow/tensorflow/tree/
aa15692e54390cf3967d51bc60acf5f783df9c08/tensorflow/contrib/model_pruningfor model pruning only contained the training part that associates tensor to be pruned with binary masks. The sparse tensor representation and bit mask were not implemented at the moment. We implemented it ourselves and found the compression rate not satisfiable.
After some trials and errors and comparison with float16 compression procedure, we decided not to pursue pruning for model compression.
2.2.3 Quantization
Quantization [23] can achieve compression rate by representing element in tensor as 8 bits unsigned integers rather than 32 bits float value. Tensorflow provides useful tools to quantize a pretrained graph. However, we found the output of graph to be frozen graph of pb format. We were able to convert meta, index, and checkpoint files to pb file, but not vice versa. Yet the competition requires that the submitted model must be loadable as a TensorFlow MetaGraph. We did not overcome this technical difficulty and went with our float16 compression method for its easy and elegant usage.
2.3 Ensemble
2.3.1 Checkpoint Average
For each single model, we first evaluate validation set using single checkpoints at some interval (say, every 10k steps) in order to know the approximate range of steps corresponding to higher scores for a particular model. Then we select a few ranges with varying starting and ending steps, and average all the checkpoints in that range. Finally we evaluate using these average points and pick the best one to represent this single model in the ensemble. On average, the averaged checkpoint gives a 0.0039 GAP boost over the best single checkpoint that we evaluated, as shown in Table 1.
2.3.2 Ensemble Single Models into One Graph
As the competition requires the final model to be in a single tensorflow graph, it is not viable to ensemble single models’ outputs. Instead, we build ensemble graph at the evaluation/inference stage, and overwrite the (untrained) float16 weights tensor with previously trained float32 single model weights tensors.
Code snippet to build ensemble graph
with tf.variable_scope("tower"): result = model[0].create_model( model_input, ..., cluster_size = FLAGS.netvlad_cluster_size if \ type(FLAGS.netvlad_cluster_size) is int \ else FLAGS.netvlad_cluster_size[0], hidden_size = FLAGS.netvlad_hidden_size if \ type(FLAGS.netvlad_hidden_size) is int \ else FLAGS.netvlad_hidden_size[0]) if FLAGS.ensemble_num> 1: predictions_lst = [result["predictions"]] for ensemble_idx in range(1,FLAGS.ensemble_num): with (tf.variable_scope("model"+str(ensemble_idx))): result2 = model[ensemble_idx].create_model( model_input, ..., cluster_size = FLAGS.netvlad_cluster_size[ensemble_idx], hidden_size = FLAGS.netvlad_hidden_size[ensemble_idx]) predictions_lst.append(result2["predictions"]) if FLAGS.ensemble_num==1: predictions = result["predictions"] else: predictions = 0 for ensemble_idx in range(FLAGS.ensemble_num): predictions += predictions_lst[ensemble_idx] * FLAGS.ensemble_wts[ensemble_idx]
To sum up our steps:

Train all single models in default float32 precision and average checkpoints.

For each potential ensemble model combination, build ensemble graph as above in float16 precision without training.

Populate ensemble graph’s float16 weight tensors with single model’s float32 trained weights.

Tune ensemble combinations and coefficients based on validate GAP.
3 Experiment
We trained and evaluated 28 models with same architecture but different cluster size and hidden size mentioned in 2.1. We use the convention KH to denote model with cluster size and hidden size . If a model has larger size but same or worse score than another model, we mark it an inferior model and eliminate it from the candidate list for later ensemble. There are 18 models left on the list, which is detailed in Table 1. The 10 eliminated models are K20H1600, K64H1024, K150H600, K128H512, K16H1024, K32H400, K128H1024, K8H1024, K32H800 and K8H800.
Model  Float32  Float16  Compression  Best avg  Best single 

model size  model size  rate  ckpt GAP  ckpt GAP  
K24H1440 (Y)  714.93  381.40  46.7%  0.8744  0.8682 
K32H1280 (G)  679.73  358.85  47.2%  0.8731  0.8669 
K100H800 (F)  663.87  339.76  48.8%  0.8721  0.8677 
K16H1280 (X)  594.59  316.21  46.8%  0.8718  0.8665 
K32H1024 (H)  549.24  286.85  47.8%  0.8717  0.8665 
K20H960 (K)  469.18  245.31  47.7%  0.8717  0.8661 
K16H800 (L)  384.27  199.61  48.1%  0.8712  0.8666 
K64H512 (J)  365.53  186.11  49.1%  0.8702  0.8664 
K100H400 (A)  357.21  180.93  49.3%  0.8685  0.8647 
K32H600 (N)  339.72  174.20  48.7%  0.8684  0.8648 
K32H512 (S)  297.27  151.85  48.9%  0.8683  0.8647 
K16H512 (O)  263.13  134.71  48.8%  0.8665  0.8641 
K8H512 (T)  246.07  126.15  48.7%  0.8660  0.8627 
K16H400 (Q)  217.05  110.50  49.1%  0.8658  0.8619 
K10H400 (E)  207.04  105.47  49.1%  0.8644  0.8626 
K32H256 (R)  175.78  88.85  49.5%  0.8616  0.8593 
K10H300 (M)  168.87  85.58  49.3%  0.8605  0.8589 
K16H256 (P)  158.65  80.21  49.4%  0.8594  0.8576 
We then tried out different combinations of single models, subject to the model size constraint of 1GB uncompressed. Details of ensemble models are shown in Table 2. Our final selected 7th place model is the YHLS ensemble with ensemble weights = , see Fig. 2.
Ensemble  Model  Local  Public LB  Private LB  Public  Private 

size  GAP  GAP  GAP  delta  delta  
YHLS  1019.7  0.8821  0.8839  0.8832  0.0018  0.0011 
YFH  1008.0  0.8819  
GHLS  997.2  0.8818  0.8838  0.8832  0.0020  0.0014 
YGLP  1020.1  0.8814  
YLSQRP  1012.4  0.8811  
GHLRP  1015.0  0.8811  0.8832  0.8825  0.0021  0.0014 
GFX  1014.8  0.8810  
FJSOQR  1012.5  0.8810  0.8828  0.8822  0.0018  0.0012 
GFAB  1004.0  0.8808  0.8826  0.8821  0.0018  0.0013 
FJSO  812.4  0.8808  
LNSOTQP  977.2  0.8803  
GF  698.9  0.8785 
4 Conclusions
In this paper we summarized our 7th place solution to the 2nd YouTube8M Video Understanding Challenge. We chose the same Gated NetVLAD model architecture and trained 28 models with different hyperparameters and varying sizes. We applied three techniques to ensemble single models into a constrainedsize tensorflow graph: averaging checkpoints, float16 compression, and building ensemble graph.
In our experiments, we found that in the Gated NetVLAD model, hidden size can create a bottleneck for information representation. Sacrificing cluster size in exchange for hidden size can achieve better results for constrainedsize models. In ensemble, we noticed that more smaller size single models do not always beat fewer larger size single models. The optimal number of models in the ensemble is 3 to 4 in our framework. We believe this is due to the boosting effect of ensemble diminishing when we add too many models. For example, even if the ensemble of two smaller models show better GAP score than a single model with the same total size, after further ensembling with more models , the ensemble may beat ensemble —the boosting effect between and is not as good as between and as some of the model variety boost is already “used” between and .
Had we had more time, we would explore other models and techniques since we had already done a fairly thorough search in gated NetVLAD model’s hyperparameter space and tried nearly all the ensemble combinations. Given more time, we would further experiment 8bit quantization. Should it work out, we would train models with different architectures such as LSTM, Bagofvisualwords and Fischer Vector and fit them in the gained extra space in the model. In parallel, we could have tried distillation technique since it gives a boost in performance without needing extra model size.
References
 [1] Miech, A., Laptev, I., Sivic, J.: Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905 (2017)
 [2] Ebrahimi Kahou, S., Michalski, V., Konda, K., Memisevic, R., Pal, C.: Recurrent neural networks for emotion recognition in video. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ACM (2015) 467–474

[3]
Caba Heilbron, F., Escorcia, V., Ghanem, B., Carlos Niebles, J.:
Activitynet: A largescale video benchmark for human activity
understanding.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 961–970
 [4] Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative cnn video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1798–1807
 [5] AbuElHaija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube8m: A largescale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
 [6] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 770–778
 [7] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. (2012) 1097–1105
 [8] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014)

[9]
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.:
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In: AAAI. Volume 4. (2017) 12  [10] Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: null, IEEE (2003) 1470
 [11] Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV. Volume 1., Prague (2004) 1–2
 [12] Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE conference on computer vision and pattern recognition, IEEE (2007) 1–8
 [13] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., FeiFei, L.: Largescale video classification with convolutional neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2014)
 [14] Ballas, N., Yao, L., Pal, C., Courville, A.: Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432 (2015)
 [15] YueHei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., Toderici, G.: Beyond short snippets: Deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 4694–4702
 [16] Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 5297–5307
 [17] Zhang, T., Golub, G.H.: Rankone approximation to high order tensors. SIAM Journal on Matrix Analysis and Applications 23(2) (2001) 534–550
 [18] Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM review 51(3) (2009) 455–500
 [19] Liu, T., Yuan, M., Zhao, H.: Characterizing spatiotemporal transcriptome of human brain via low rank tensor decomposition. arXiv preprint arXiv:1702.07449 (2017)
 [20] Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems. (2014) 1269–1277
 [21] Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in neural information processing systems. (2015) 1135–1143
 [22] Zhu, M., Gupta, S.: To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878 (2017)
 [23] Han, S., Mao, H., Dally, W.J.: Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015)
 [24] Gong, Y., Liu, L., Yang, M., Bourdev, L.: Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115 (2014)
 [25] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
 [26] Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 (2014)
 [27] Wang, H.D., Zhang, T., Wu, J.: The monkeytyping solution to the youtube8m video understanding challenge. arXiv preprint arXiv:1706.05150 (2017)
 [28] Li, F., Gan, C., Liu, X., Bian, Y., Long, X., Li, Y., Li, Z., Zhou, J., Wen, S.: Temporal modeling approaches for largescale youtube8m video understanding. arXiv preprint arXiv:1707.04555 (2017)
 [29] Chen, S., Wang, X., Tang, Y., Chen, X., Wu, Z., Jiang, Y.G.: Aggregating framelevel features for largescale video classification. arXiv preprint arXiv:1707.00803 (2017)
 [30] Skalic, M., Pekalski, M., Pan, X.E.: Deep learning methods for efficient large scale video labeling. arXiv preprint arXiv:1706.04572 (2017)
 [31] Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural computation 3(1) (1991) 79–87
Comments
There are no comments yet.