Our approach features Actor-Context-Actor Relation Network (ACAR-Net), details of which can be found in . Our proposed ACAR-Net gives an efficient yet effective algorithm to explicitly model and utilize higher-order relations built upon the basic first-order actor-context relations for assisting action localization.
1.1 Overall Framework
We first introduce our overall framework for action localization, where our proposed ACAR-Net is its key module for high-order relation modeling. The framework is designed to detect all persons in an input video clip and predict their action labels.
We combine an off-the-shelf person detector (e.g. Faster R-CNN ) with a video backbone network (e.g. I3D ). In details, the detector operates on the center frame (i.e. key frame) of the clip and obtains detected actors. Such detected boxes are duplicated to other frames of the clip. In the mean time, the backbone network extracts a spatio-temporal feature volume from the input video clip. We perform average pooling along the temporal dimension considering computational efficiency, which results in a feature map , where correspond to channel, height and width respectively. We apply RoIAlign  (
spatial output) followed by spatial max pooling to the feature mapand the actor boxes, producing a series of actor features, . Each actor feature describes the spatio-temporal appearance and motion of one Region of Interest (RoI).
The final classification head takes the aforementioned video feature map and RoI features
as inputs, and outputs the final action predictions possibly after relation reasoning. The simplest baseline is a ”Linear” head, which directly applies a linear classifier to the RoI features.
1.2 Actor-Context-Actor Relation Network
We first encode the first-order actor-context relations between each actor and each spatial location of the spatio-temporal context. More specifically, it concatenates each actor feature to all spatial locations of the context feature to form a concatenated feature map . The actor-context relation feature for actor can then be computed by applying convolutions to this concatenated feature map.
Based on the actor-context relations, we further add a High-order Relation Reasoning Operator (HRO) for modeling the connections established on first-order relations, which are indirect relations mostly ignored by previous methods. Let record the first-order feature between the actor and the scene context at the spatial location . We introduce High-order Relation Reasoning, in order to explicitly model the relations between first-order actor-context relations, which encode more informative scene semantics. However, since there are a large number of actor-context relation features, , , the number of their possible pairwise combinations are generally overwhelming. We therefore propose to focus on learning the high-order relations between different actor-context relations at the same spatial location , i.e. and . In this way, the proposed relational reasoning operator limits the relation learning to second-order actor-context-actor relations, i.e. two actors and can be associated via the same spatial context as to help the prediction of their action labels.
In general, our high-order relation reasoning block ACAR-Net is weakly-supervised, which only requires action labels as supervision.
We implement the High-order Relation Reasoning Operator as a location-wise attention operator, which is natural for modeling the connections between multiple first-order relations at the same spatial location. The operator consists of one or two modified non-local blocks 
. Since we are operating on a spatial grid of features, we replace the fully-connected layers in the non-local block with convolutional layers, and the attention vector is computed separately at every spatial location. Following, we also add layer normalization and dropout to our modified non-local block for improving regularization.
For saving memory, spatial max pooling is applied by default to the first-order relation maps before feeding them into our operator. The high-order relation map will be spatially average-pooled, and then channel-wise concatenated to the basic actor RoI feature vector for final classification. All relation vectors are of dimension in our implementation.
1.3 Actor-Context Feature Bank
Inspired by the Long-term Feature Bank (LFB) , which creates a feature bank over a large time span to facilitate first-order actor-actor relation reasoning, we consider creating an Actor-Context Feature Bank which is built upon the first-relation features computed in our ACAR-Net. Formally, , where is the first-order actor-context relation map extracted from a short video clip around time . This bank of features can be obtained by running a trained ACAR-Net over the entire video at evenly spaced intervals (by default 1 second) and saving the intermediate first-order relation maps. Different from the original LFB, our relational feature preserves spatial context information. Equipped with such a relational feature bank, our ACAR-Net can leverage its High-order Relation Reasoning Operator for reasoning actor-context-actor relations over a much longer time span, and thus better capture what is happening in the entire video for achieving more accurate action localization at the current time stamp.
We experiment on ACFB with the aforementioned implementation of high-order relation reasoning in ACAR-Net. We stack two modified non-local blocks, and replace the self-attention mechanism with an attention between current and long-term actor-context relations. For AVA videos, we set the long-term time span to 20 seconds, and for a Kinetics video, the bank simply spans across the entire video whose length is at most 10 seconds. For faster convergence, we do not apply spatial max pooling before .
2.1 Implementation Details
of spatio-temporally localized atomic visual actions contains over 238k unique videos and more than 624k annotated frames. For AVA, box annotations and their corresponding action labels are provided on key frames of 430 15-minute movie clips with a temporal stride of 1 second, while for Kinetics only a single frame is annotated for each video. Following the guidelines of the challenge, we evaluate on 60 action classes, and the performance metric is mean Average Precision (mAP) using a frame-level IoU threshold of 0.5.
As for person detector on key frames, we adopted the detection model from , which is a Faster R-CNN  with an SENet-154-FPN-TSD [9, 12, 16] backbone. The model is pre-trained on OpenImage , and then fine-tuned on AVA and Kinetics respectively. The final models obtain 95.8 AP@50 on the AVA validation set, and 84.4 AP@50 on the Kinetics validation set.
|model||head||3S+F||val mAP||test mAP|
|model||head||dataset||AVA val mAP||AVA test mAP|
|SlowFast, ensemble ||Linear||AVA||-||34.25|
We use SlowFast networks  as the backbone in our localization framework, and we also increase the spatial resolution of res by . We use SlowFast R101 and R152 instantiations with input sampling and (without non-local) pre-trained on the Kinetics-700 dataset .
We use per-class binary cross entropy loss as the training loss function. Since one person should only have one pose label, following
, we apply a softmax function instead of sigmoid to the logits corresponding to pose classes.
We train all models in an end-to-end fashion (except the feature bank part) using synchronous SGD with a minibatch size of 32 clips. We freeze batch normalization layers in the backbone network. We train most models for 55k steps (6 epochs on the training set) with a base learning rate of 0.064, which is decreased by a factor of 10 at iterations 51k and 53k. A few models are trained with an extended 8-epoch schedule for final ensemble. We perform linear warm-up during the first 9k iterations. For models submitted to the test server, we train on both training and validation data for the same number of epochs. We use weight decay of
and Nesterov momentum of 0.9. For a model withinput sampling, we use frames centered at the key frame as input, sampled with a temporal stride of 2. Note that in some Kinetics data, the annotated timestamps are too close to the end of the videos, and in these cases we simply sample the last
frames. In order to better preserve spatial structure, we do not use spatial random cropping augmentation. Instead, we only scale the shorter side of the input frames to 256 pixels, and zero pad the longer side to the same size in order to simplify mini-batch training. For AVA, we use both ground-truth boxes and predicted human boxes from for training, and only ground-truth boxes for generating feature banks. For Kinetics, we only use ground-truth boxes for training, and our detection boxes for generating feature banks. The bank of features are extracted from short clips sampled with a temporal stride of 1 second from both AVA and Kinetics videos. We use bounding box jittering augmentation, which randomly perturbs box coordinates by a scale at most 7.5% relative to the original size of the bounding box during training.
At test time, we use AVA detections with confidence and Kinetics detections with confidence . We scale the shorter side of input frames to 256 pixels, and apply the backbone feature extractor fully-convolutionally. We also report results tested with three spatial scales and horizontal flips.
2.2 Main Results
We present our results on AVA-Kinetics v1.0 in Table 1. The default backbone instantiation is SlowFast R101 . The simplest baseline, linear classifier head, already has nearly 33mAP. Switching to our ACAR-Net still brings about a significant 1.6 increase in mAP. This highlights the importance of modeling high-order relations. Further adding long-term support (ACFB head) gives a total boost of 2.86mAP. We also experiment on a more advanced backbone (SlowFast R152 ) which brings some extra improvement in performance (+0.54mAP). We re-trained this SlowFast R152 model on training and validation data, and submitted it to the test server. Its performance almost did not drop (-0.13mAP).
2.3 Ablation Experiments
Effect of Adding Kinetics Data.
For two SlowFast R101 models, we train the same model with two different datasets, AVA-Kinetics and AVA only, and evaluate on AVA v2.2 validation set. As shown in Table 2, adding Kinetics data brings consistent mAP increases (roughly +2mAP) to these two models. Moreover, with the help of both high-order relation reasoning and Kinetics data, our ensemble achieves a significant enhancement of +4.05mAP on AVA v2.2 test set compared to last year’s winner .
We investigate the effect of person detection AP@50 on action detection mAP. We perform the comparison on SlowFast R101 with ACAR head. As presented in Table 3 and 4, person detection AP@50 on Kinetics is much lower than that on AVA. In addition, even though our detector have reached 95.8 AP@50 on AVA person detection, there is still a large gap (8.1mAP) in final mAP between our detection and ground-truth (GT). These results suggest that how to improve person detection for action localization still remains to be explored.
2.4 Pre-training on Kinetics
We used 4 SlowFast models pre-trained from scratch on Kinetics-700 classification task. We show their single center crop () accuracy on Kinetics-700 validation set in Table 5. Note that our models might have not reached full convergence due to time limitation.
-  (2018) Pfdet: 2nd place solution to open images challenge 2018 object detection track. arXiv preprint arXiv:1809.00778. Cited by: §2.2.
-  (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Cited by: §2.1, §2.1.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In , pp. 6299–6308. Cited by: §1.1.
-  (2019)(Website) Cited by: §2.3, Table 2.
-  (2019) Slowfast networks for video recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6202–6211. Cited by: §2.1.
Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §2.1.
-  (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: §2.1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.1.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2.1.
-  (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982. Cited by: §2.1.
-  (2020) The ava-kinetics localized human actions video dataset. arXiv preprint arXiv:2005.00214. Cited by: §2.1.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.1.
-  (2020) 1st place solutions for openimage2019–object detection and instance segmentation. arXiv preprint arXiv:2003.07557. Cited by: §2.1.
-  (2020) Actor-context-actor relation network for spatio-temporal action localization. arXiv preprint arXiv:2006.07976. Cited by: Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization 1st place solution for AVA-Kinetics Crossover in AcitivityNet Challenge 2020, §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.1, §2.1.
-  (2020) Revisiting the sibling head in object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11563–11572. Cited by: §2.1.
Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §1.2.
-  (2019) Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 284–293. Cited by: §1.2, §1.3, §2.1, §2.2, Table 3.
-  (2019) Three branches: detecting actions with richer features. arXiv preprint arXiv:1908.04519. Cited by: §2.1.