Despite considerable advances in the ability to estimate position and pose for people and objects, the computer vision community lacks models that can describe what people are doing at even short-time scales. This has been highlighted by new datasets such as Charadessigurdsson2016hollywood and AVA gu2017ava
, where the goal is to recognize the set of actions people are doing in each frame of example videos – e.g. one person may be standing and talking while holding an object in one moment, then it puts the object back and sits down on a chair. The winning system of the Charades challenge 2017 obtained just around 21% accuracy on this per-frame classification task. On AVA the task is even harder as there may be multiple people and the task additionally requires localizing people and describing their actions individually – a strong baseline gets just under 15% on this taskgu2017ava . The top approaches in both cases used I3D models trained on ImageNet russakovsky2015imagenet and the Kinetics-400 dataset kay2017kinetics .
In this work, we focus on diagnosing and improving that system by carefully examining the various design decisions that go into building a video action localization model. Specifically, we find data augmentation, class agnostic bounding box regression and pre-training lead to strong performance gains on AVA. Our resulting model outperforms all previous approaches, including all submissions to the AVA challenge at CVPR 2018. This includes various highly sophisticated solutions involving multiple input modalities like optical flow and audio, as well as ensembles of multiple network architectures.
2 Model and Approach
Our model is inspired by I3D carreira2017quo and Faster R-CNN ren2015faster , similar to girdhar2018detecttrack ; hou2017tube . We start from labeled frames in the AVA dataset, and extract a short video clip, typically 64 frames, around that keyframe. We pass this clip through I3D blocks up to Mixed_4f, which are pre-trained on the Kinetics dataset for action classification. The feature map is then sliced to get the representation corresponding to the center frame (the keyframe where the action labels are defined). This is passed through the standard region proposal network (RPN) ren2015faster
to extract box proposals for persons in the image. We keep the top 300 region proposals for the next step: extracting features for each region that feed into a classifier.
Since the RPN-detected regions corresponding to just the center frame are 2D, we extend them in time by replicating them to the match the temporal dimension of the intermediate feature map, following the procedure for the original AVA algorithm gu2017ava . We then extract an intermediate feature map for each proposal using the RoIPool operation, applied independently at each time step, and concatenated in time dimension to get a 4-D region feature map for each region. This feature map is then passed through the last two blocks of the I3D model (up to Mixed_5c
, and classified into each of the 80 action classes. The box classification is treated as a non-exclusive problem, so probabilities are obtained through an independent sigmoid for each class. We also apply bounding box regression to each selected box following Faster R-CNNren2015faster , except that our regression is independent of category (since the bounding box should capture the person regardless of the action). Finally we post-process the predictions from the network using non-maximal suppression (NMS), which is applied independently for each class. We keep the top-scoring 300 class-specific boxes (note that the same box may be repeated with multiple different classes in this final list) and drop the rest.
|ResNet-based model ava_baseline||11.3|
|RGB only gu2017ava||14.5|
|RGB + Flow gu2017ava||15.6|
|Ours + JFT||22.8|
We trained the model on the training set using a synchronized distributed setting with 11 V100 GPUs. We used batches of 3 videos with 64 frames each, and augmented the data with left-right flipping and spatial cropping. We trained the model for 500k steps using SGD with momentum and cosine learning rate annealing. Before submitting to the challenge evaluation server, we finetuned the model further on the union of the train and validation sets. We tried both freezing batch norm layers and finetuning them with little difference in performance. For the experiments of training the model from scratch, we train the batch norm layers as well. Since that leads to higher memory usage, we use a batch size of 2 and train over 32 GPUs (for an effectively similar batch size).
Results of our model on the validation set are compared with results from the models in the AVA paper in Table 1. The RGB-only baseline gu2017ava used a similar I3D feature extractor similar to ours, but pretrained on ImageNet then Kinetics-400, whereas our model was just pretrained on Kinetics-400 or the larger Kinetics-600. This baseline differs from ours in a few other ways: 1) it used a ResNet-50 for computing proposals and I3D for computing features for the classification stage, whereas we only use the same I3D features for both; 2) our model preserves the spatiotemporal nature of the I3D features all the way to the final classification layer, whereas theirs performs global average pooling in time of the I3D features right after ROI-pooling; 3) we opted for action-independent bounding box regression, whereas theirs learns 80 different regressors, one for each class. The RGB+Flow baseline is similar but also uses flow inputs and a Flow-I3D model, also pretrained on Kinetics-400. The ResNet-101 baseline corresponds to a traditional Faster-RCNN object detector system applied to human action classes instead of objects, using just a single frame as input to the model.
Our model achieves a significant improvement of nearly 40% over the best baseline (RGB+Flow), while using just RGB and just one pretrained model instead 3 separate ones. The results suggest that simplicity, coupled with a large pre-training dataset for action recognition, are helpful for action detection. This is reasonable considering that many AVA categories have very few examples, and so over-fitting is a serious problem.
We also formally ablate some of the important design decisions of our model in Figure 3. First, we evaluate the effect of initialization by comparing the scratch trained model with the Kinetics initialized model. We find about 2% improvement on pretraining. Next we evaluate the importance of class agnostic bounding box regression compared to class-specific, and observe an almost 4% gap in performance. This makes sense as our object is always a human, and it is a good idea to share the parameters for localizing a human across classes, as some of the smaller classes may not have enough data to learn an effective representation. Next we compare our model with and without data augmentation; in our case random flips and crops. This gives almost 5% improvement, again signifying the importance of maximizing the amount of training data we can use.
Scene context: To further incorporate context for recognizing actions, we experimented with adding full-image features for the key-frame when classifying each box in the video clip. We use the last layer features from a ResNet-101 pre-trained on the JFT dataset sun2017revisiting . We found that concatenating the 512D global_pool features with the each box’s features before classification gave a 0.9% improvement on the val set, as shown in Table 1. Hence we incorporate this in our final model. Finally, we show the per-class performance of our model in Figure 2. We show our final test performance and comparison with other submissions in Table 2.
|Ours + JFT||RGB only||I3D + FRCNN||21.91|
|RGB only||I3D + FRCNN||21.03|
|RGB + Flow||
|YH Technologies yh_ava_submit||RGB + Flow||P3D + FRCNN||19.60|
We have presented an action localization model that aims to densely classify the actions of multiple people in video using the Faster R-CNN framework with spatiotemporal features from an I3D model pretrained on the Kinetics dataset. We show a large improvement over the state-of-the-art on the AVA dataset, but at 21.91% AP, performance is still far from what would be practical for applications. More work remains to be done to understand the problems in the model and dataset, such as handling classes with very small number of training examples. In the meanwhile, continuing to grow datasets such as Kinetics should help.
Many thanks to the AVA team for creating and sharing their dataset, models and code.
-  AVA Team. Ava v2.1 faster rcnn resnet-101 baseline. https://research.google.com/ava/download.html. Accessed: 2018-06-10.
J. Carreira and A. Zisserman.
Quo vadis, action recognition? a new model and the kinetics dataset.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  R. Girdhar, G. Gkioxari, L. Torresani, M. Paluri, and D. Tran. Detect-and-Track: Efficient Pose Estimation in Videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, C. Schmid, and J. Malik. Ava: A video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
R. Hou, C. Chen, and M. Shah.
Tube convolutional neural network (t-cnn) for action detection in videos.In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
-  T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  Z. Qiu, T. Yao, and T. Mei. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NIPS), 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
-  G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
C. Sun, A. Shrivastava, S. Singh, and A. Gupta.
Revisiting unreasonable effectiveness of data in deep learning era.In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  T. Yao and X. Li. Yh technologies at activitynet challenge 2018. arXiv preprint arXiv:1807.00686, 2018.