Action-Localization, Atomic Visual Actions (AVA) Dataset
We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin - more than 7.5 relative) improvement, using only raw RGB frames as input.READ FULL TEXT VIEW PDF
Action-Localization, Atomic Visual Actions (AVA) Dataset
In this paper, our objective is to both localize and recognize human actions in video clips. One reason that human actions remain so difficult to recognize is that inferring a person’s actions often requires understanding the people and objects around them. For instance, recognizing whether a person is ‘listening to someone’ is predicated on the existence of another person in the scene saying something. Similarly, recognizing whether a person is ‘pointing to an object’, or ‘holding an object’, or ‘shaking hands’; all require reasoning jointly about the person and the animate and inanimate elements of their surroundings. Note that this is not limited to the context at a given point in time: recognizing the action of ‘watching a person’, after the watched person has walked out of frame, requires reasoning over time to understand that our person of interest is actually looking at someone and not just staring into the distance.
Thus we seek a model that can determine and utilize such contextual information (other people, other objects) when determining the action of a person of interest. The Transformer architecture from Vaswani et al.  is one suitable model for this, since it explicitly builds contextual support for its representations using self-attention. This architecture has been hugely successful for sequence modelling tasks compared to traditional recurrent models. The question, however, is: how does one build a similar model for human action recognition?
Our answer is a new video action recognition network, the Action Transformer, that uses a modified Transformer architecture as a ‘head’ to classify the action of a person of interest. It brings together two other ideas: (i) a spatio-temporal I3D model that has been successful in previous approaches for action recognition in video  – this provides the base features; and (ii) a region proposal network (RPN)  – this provides a sampling mechanism for localizing people performing actions. Together the I3D features and RPN generate the query that is the input for the Transformer head that aggregates contextual information from other people and objects in the surrounding video. We describe this architecture in detail in section 3. We show in section 4 that the trained network is able to learn both to track individual people and to contextualize their actions in terms of the actions of other people in the video. In addition, the transformer attends to hand and face regions, which is reassuring because we know they have some of the most relevant features when discriminating an action. All of this is obtained without explicit supervision, but is instead learned during action classification.
We train and test our model on the newly introduced Atomic Visual Actions (AVA)  dataset. It is an interesting and suitable testbed for this kind of contextual reasoning. It requires detecting multiple people in videos semi-densely in time and recognizing multiple basic actions. Many of these actions often cannot be determined from the person bounding box alone, but instead require inferring relations to other people and objects. Unlike previous works , our model learns to do so without needing explicit object detections. We set a new record on the AVA dataset, improving performance from 17.4%  to 25.0% mAP. The network only uses raw RGB frames, yet it outperforms all previous work, including large ensembles that use additional optical flow and sound inputs. At the time of submission, ours was the top performing approach on the ActivityNet leaderboard .
However, we note that at 25% mAP, this problem, or even this dataset, is far from solved. Hence, we rigorously analyze the failure cases of our model in Section 5. We describe some common failure modes and analyze the performance broken down by semantic and spatial labels. Interestingly, we find many classes with relatively large train sets are still hard to recognize. We investigate such tail cases to flag potential avenues for future work.
Video Understanding: Video activity recognition has evolved rapidly in recent years. Datasets have become progressively larger and harder: from actors performing simple actions [12, 33], to short sports and movie clips [38, 24], finally to diverse youtube videos [23, 1]. Models have followed suit, from hand-crafted features [25, 42] to deep end-to-end trainable models [22, 43, 7, 45, 44]. However, much of this work has focused on trimmed action recognition, i.e., classifying a short clip into action classes. While useful, this is a rather limited view of action understanding, as most videos involve multiple people performing multiple different actions at any given time. Some recent work has looked at such fine-grained video understanding [37, 17, 8, 21], but has largely been limited to small datasets like UCF-24 [38, 37] or JHMDB . Another thread of work has focused on temporal action detection [36, 35, 46]; however, it does not tackle the tasks of person detection or person-action attribution.
AVA dataset and methods: The recently introduced AVA  dataset has attempted to remedy this by introducing 15-minute long clips labeled with all people and their actions at one second intervals. Although fairly new, various models [14, 39, 20, 48] have already been proposed for this task. Most models have attempted to extend object detection frameworks [15, 31, 18] to operate on videos [17, 21, 9]. Perhaps the closest to our approach is the concurrent work on person-centric relation networks , which learns to relate person features with the video clip akin to relation networks . In contrast, we propose to use person detections as queries to seek out regions to aggregate in order to recognize their actions, and outperform  other prior works by a large margin.
Attention for action recognition:
There has been a large body of work on incorporating attention in neural networks, primarily focused on language related tasks[41, 47]. Attention for videos has been pursued in various forms, including gating or second order pooling [45, 10, 27, 28] guided by human pose or other primitives [5, 11, 4, 10], recurrent models  and self-attention . Our model can be thought of as a form of self-attention complementary to these approaches. Instead of comparing all pairs of pixels, it reduces one side of the comparison to human regions, and can be applied on top of a variety of base architectures, including the previously mentioned attentional architectures like .
In this section we describe the overall design of our new Action Transformer model. The model is designed to detect all persons, and classify all the actions they are doing, at a given time point (‘keyframe’). It ingests a short video clip centered on the keyframe, and generates a set of human bounding boxes for all the people in the central frame, with each box labelled with all the predicted actions for the person.
The model consists of a distinct base and head networks, similar to the Faster R-CNN object detection framework . The base, which we also refer to as trunk, uses a 3D convolutional architecture to generate features and region proposals (RP) for the people present. The head then uses the features associated with each proposal to predict actions and regresses a tighter bounding box. Note that, importantly, both the RPN and bounding box regression are action agnostic. In detail, the head uses the feature map generated by the trunk, along with the RPN proposals, to generate a feature representation corresponding to each RP using RoIPool  operations. This feature is then used classify the box into action classes or background (total
), and regress to a 4D vector of offsets to convert the RPN proposal into a tight bounding box around the person. The base is described in Section3.1, and the transformer head in Section 3.2. We also describe an alternative I3D Head in Section 3.3, which is a more direct analogue of the Faster-RCNN head. It is used in the ablation study. Implementation details are given in Section 3.4.
We start by extracting a -frame (typically 64) clip from the original video, encoding about 3 seconds of context around a given keyframe. We encode this input using a set of convolutional layers, and refer to this network as the trunk. In practice, we use the initial layers of an I3D network pre-trained on Kinetics-400 . We extract the feature map from the Mixed_4f layer, by which the input is downsampled to . We slice out the temporally-central frame from this feature map and pass it through a region proposal network (RPN) . The RPN generates multiple potential person bounding boxes along with objectness scores. We then select boxes (we use ) with the highest objectness scores to be further regressed into a tight bounding box and classified into the action classes using a ‘head’ network, as we describe next. The trunk and RPN portions of Figure 2 illustrate the network described so far.
As outlined in the Introduction, our head architecture is inspired and re-purposed from the Transformer architecture . It uses the person box from the RPN as a ‘query’ to locate regions to attend to, and aggregates the information over the clip to classify their actions. We first briefly review the Transformer architecture, and then describe our Action Transformer head framework.
Transformer: This architecture was proposed in  for seq2seq tasks like language translation, to replace traditional recurrent models. The main idea of the original architecture is to compute self-attention by comparing a feature to all other features in the sequence. This is carried out efficiently by not using the original features directly. Instead, features are first mapped to a query () and memory (key and value, & ) embedding using linear projections, where typically the query and keys are lower dimensional. The output for the query is computed as an attention weighted sum of values , with the attention weights obtained from the product of the query with keys . In practice, the query here was the word being translated, and the keys and values are linear projections of the input sequence and the output sequence generated so far. A location embedding is also added to these representations in order to incorporate positional information which is lost in this non-convolutional setup. We refer the readers to  and  for a more detailed description of the original architecture.
Action Transformer: We now describe our re-purposed Transformer architecture for the task of video understanding. Our transformer unit takes as input the video feature representation and the box proposal from RPN and maps it into query and memory features. Our problem setup has a natural choice for the query (), key () and value (
) tensors: the person being classified is the query, and the clip around the person is the memory, projected into key and values. The unit then processes the query and memory to output an updated query vector. The intuition is that the self-attention will add context from other people and objects in the clip to the query vector, to aid with the subsequent classification. This unit can be stacked in multiple heads and layers similar to the original architecture, by concatenating the output from the multiple heads at a given layer, and using the concatenated feature as the next query. This updated query is then used to again attend to context features in the following layer. We show this high-level setup and how it fits into our base network highlighted in green in Figure 2, with each Action Transformer unit denoted as ‘Tx’. We now explain this unit in detail.
The key and value features are simply computed as linear projections of the original feature map from the trunk, hence each is of shape . In practice, we extract the RoIPool-ed feature for the person box from the center clip, and pass it through a query preprocessor (QPr) and a linear layer to get the query feature of size . The QPr could directly average the RoIpool feature across space, but would lose all spatial layout of the person. Instead, we first reduce the dimensionality by a convolution, and then concatenate the cells of the resulting feature map into a vector. Finally, we reduce the dimensionality of this feature map using a linear layer to 128D (the same as the query and key feature maps). We refer to this procedure as HighRes query preprocessing. We compare this to a QPr that simply averages the feature spatially, or LowRes preprocessing, in Section 4.3.
The remaining architecture essentially follows the Transformer. We use feature corresponding to the RPN proposal , for dot-product attention over the features, normalized by (same as ), and use the result for weighted averaging () of features. This operation can be succinctly represented as
We apply a dropout to and add it to the original query feature. The resulting query is passed through a residual branch consisting of a LayerNorm  operation, followed by a 2-layer MLP and dropout. The final feature is passed through one more LayerNorm to get the updated query (). Figure 2 (Tx unit) illustrates the unit architecture described above, and can be represented as
To measure the importance of the context gathered by our Action Transformer head, we also built a simpler head architecture that does not extract context. For this, we extract a feature representation corresponding to the RPN proposal from the feature map using a Spatio-Temporal RoIPool (ST-RoIPool) operation. It’s implemented by first stretching the RP in time by replicating the box to form a tube. Then, we extract a feature representation from feature map at each time point using the corresponding box from the tube using the standard RoIPool operation , similar to previous works . The resulting features across time are stacked to get a spatio-temporal feature map corresponding to the tube. It is then passed through the layers of the I3D network that were dropped from the trunk (i.e., Mixed_5a to Mixed_5c). The resulting feature map is then passed through linear layers for classification and bounding box regression. Figure 3 illustrates this architecture.
We develop our models in Tensorflow, on top of the TF object detection API. We use input spatial resolution of px and temporal resolution () of 64. The RoIPool used for both I3D and Action Transformer head generates a
output, followed by a max pool to get afeature map. Hence, the I3D head input ends up being in size, while for Action Transformer we use the feature as query and the full trunk feature as the context. As also observed in prior work [41, 29]
, adding a location embedding in such architectures is very beneficial. It allows our model to encode spatiotemporal proximity in addition to visual similarity, a property lost when moving away from traditional convolutional or memory-based (eg. LSTM) architectures. For each cell in the trunk feature map, we add explicit location information by constructing vectors:and denoting the spatial and temporal location of that feature, relative to the center. We pass each through a 2-layer MLP, and concatenate the outputs. We then attach the resulting vector to the trunk feature map along channel dimension. Since are projections the trunk feature map, and is extracted from that feature via RoIPool, all of these will implicitly contain the location embedding. Finally, for classification loss, we use separate logistic losses for each action class, implemented using sigmoid cross-entropy, since multiple actions can be active for a given person. For regression, we use the standard smooth L1 loss. For the Action Transformer heads, we use feature dimensionality of and dropout of 0.3. We use a 2-head, 3-layer setup for the Action Transformer units by default, though we ablate other choices in the supplementary.
. We initialize the remaining layers of our model (eg. RPN, Action Transformer heads etc) from scratch, fix the running mean and variance statistics of batch norm layers to the initialization from the pre-trained model, and then finetune the full model end-to-end. Note that the only batch norm layers in our model are in the I3D base and head networks; hence, no new batch statistics need to be estimated when finetuning from the pretrained models.
Data Augmentation: We augment our training data using random flips and crops. We find this was critical, as removing augmentation lead to severe overfitting and a significant drop in performance. We evaluate the importance of pre-training and data augmentation in Section 4.6.
SGD Parameters: The training is done using synchronized SGD over V100 GPUs with an effective batch size of 30 clips per gradient step. This is typically realized by a per-GPU batch of 3 clips, and total of 10 replicas. However, since we keep batch norm fixed for all experiments except for from-scratch experiments, this batch size can be realized by splitting the batch over 10, 15 or even 30 replicas for our heavier models. Most of our models are trained for 500K iterations, which takes about a week on 10 GPUs. We use a learning rate of 0.1 with cosine learning rate annealing over the 500K iterations, though with a linear warmup  from 0.01 to 0.1 for the first 1000 iterations. For some cases, like models with Action Transformer head and using ground truth boxes (Section 4.2
), we stop training early at 300K iterations as it learns much faster. The models are trained using standard loss functions used for object detection, except for sigmoid cross-entropy for the multi-label classification loss.
In this section we experimentally evaluate the model on the AVA benchmark. We start with introducing the dataset and evaluation protocol in Section 4.1. Note that the model is required to carry out two distinct tasks: action localization and action classification. To better understand the challenge of each independently, we evaluate each task given perfect information for the other. In Section 4.2, we replace the RPN proposals with the groundtruth (GT) boxes, and keep the remaining architecture as is. Then in Section 4.3, we assume perfect classification by converting all class labels into a single ‘active’ class label, reducing the problem into a pure ‘active person’ vs background detection problem, and evaluate the person localization performance. Finally we put the lessons from the two together in Section 4.4. We perform all these ablative comparisons on the AVA validation set, and compare with the state of the art on the test set in Section 4.5.
The Atomic Visual Actions (AVA) v2.1  dataset contains 211K training, 57K validation and 117K testing clips, taken at 1 FPS from 430 15-minute movie clips. The center frame in each clip is exhaustively labeled with all the person bounding boxes, along with one or more of the 80 action classes active for each instance. Following previous works [14, 39], we report our performance on the subset of 60 classes that have at least 25 validation examples. For comparison with other challenge submissions, we also report the performance of our final model on the test set, as reported from the challenge server. Unless otherwise specified, the evaluation is performed using frame-level mean average precision (frame-AP) at IOU threshold of 0.5, as described in .
|Trunk||Head||QPr||GT Boxes||Params (M)||Val mAP|
In this section we assess how well the head can classify the actions, given the ground truth bounding boxes provided with the AVA dataset. This will give an upper bound on the action classification performance of the entire network, as RPN is likely to be less perfect than ground truth. We start by comparing the I3D head with and without GT boxes in Table 1. We use a lower value of for the RPN, in order to reduce the computational expense of these experiments. It is interesting to note that we only get a small improvement by using groundtruth (GT) boxes, indicating that our model is already capable of learning a good representation for person detection. Next, we replace the I3D head architecture with the Action Transformer, which leads to a significant 5% boost for the GT boxes case. It is also worth noting that our Action Transformer head implementation actually has 2.3M fewer parameters than the I3D head in the LowRes QPr case, dispelling any concerns that this improvement is simply from additional model capacity. The significant drop in performance with and without GT boxes for the Action Transformer is due to only using proposals. As will be seen in subsequent results, this drop is eliminated when the full model with proposals is used.
|RoI source||QPr||Head||Val mAP|
Given the strong performance of the Action Transformer for the classification task, we look now in detail to the localization task. As described previously, we isolate the localization performance by merging all classes into a single trivial one. We report performance in Table 2, both with the standard 0.5 IOU threshold, and also with a stricter 0.75 IOU threshold.
The I3D head with RPN boxes excels on this task, achieving almost 93% mAP at 0.5 IOU. The naive implementation of the transformer using a low-resolution query does quite poorly at 77.5%, but by adopting the high-resolution query, the gap in performance is considerably reduced (92.9% to 87.7%, for the IOU-0.5 metric). The transformer is less accurate for localization and this can be understood by its more global nature; additional research on this problem is warranted. However as we will show next, using the HighRes query we can already achieve a positive trade-off in performance and can leverage the classification gains to obtain a significant overall improvement.
Now we put the transformer head together with the RPN base, and apply the entire network to the tasks of detection and classification. We report our findings in Table 3. It can be seen that the Action Transformer head is far superior to the I3D head (24.4 compared to 20.5). An additional boost can be obtained (to 24.9) by using the I3D head for regression and the Action Transformer head for classification – reflecting their strengths identified in the previous sections.
|Method||Modalities||Architecture||Val mAP||Test mAP|
|Single frame ||RGB, Flow||R-50, FRCNN||14.7||-|
|AVA baseline ||RGB, Flow||I3D, FRCNN, R-50||15.6||-|
|ARCN ||RGB, Flow||S3D-G, RN||17.4||-|
|YH Technologies ||RGB, Flow||P3D, FRCNN||-||19.60|
|Ours (Tx-only head)||RGB||I3D, Tx||24.4||24.30|
|Ours (Tx+I3D head)||RGB||I3D, Tx||24.9||24.60|
|Ours (Tx+I3D+96f)||RGB||I3D, Tx||25.0||24.93|
Finally, we compare our models to the previous state of the art on the test set in Table 4. We find the Tx+I3D head obtains the best performance, and simply adding temporal context at test time (96 frames compared to 64 frames at training) leads to a further improvement. We outperform the previous state of the art by more than 7.5% absolute points on validation set, and the CVPR 2018 challenge winner by more than 3.5%. It is also worth noting that our approach is much simpler than most previously proposed approaches, especially the challenge submissions that are ensembles of multiple complex models. Moreover, we obtain this performance only using raw RGB frames as input, while prior works use RGB, Flow, and in some cases audio as well.
All our models so far have used class agnostic regression, data augmentation and Kinetics  pre-training, techniques we observed early on to be critical for good performance on this task. We now validate the importance of those design choices. We compare the performance using the I3D head network as the baseline in Table 5. As evident from the table, all three are crucial in getting strong performance. In particular, class agnostic regression is an important contribution. While typical object detection frameworks [15, 18] learn a separate regression layers for each object category, it does not make sense in our case as the ‘object’ is always a human. Sharing those parameters helps classes with few examples to also learn a good person regressor, leading to an overall boost. Finally, we note the importance of using a sufficient number of proposals in the RPN. As can be seen in Table 3, reducing the number from 300 to 64 decreases performance significantly for the Action Transformer model. The I3D head is less affected. It is interesting because, even for 64, we are using far more proposals than the actual number of people in the frame.
We now analyze the Action Transformer model. Apart from obtaining superior performance, this model is also more interpretable by explicitly encoding bottom up attention. We start by visualizing the key/value embeddings and attention maps learned by the model. Next we analyze the performance vis-a-vis specific classes, person sizes and counts; and finally visualize common failure modes.
Learned embeddings and attention: We visualize the 128D ‘key’ embeddings and attention maps in Figure 4. We visualize the embeddings by color-coding a 3D PCA projection. We show two heads out of the six in our 2-head 3-layer Action Transformer model. For attention maps, we visualize the average softmax attention over the 2 heads in the last layer of our Tx head. It is interesting to note that our model learns to track the people over the clips, as shown from the embeddings where all ‘person’ pixels are same color. Moreover, for the first head all humans have the same color, suggesting a semantic embedding, while the other has different, suggesting an instance-level embedding. Similarly, the softmax attention maps learn to attend and track faces, hands and other parts of the person of interest as well as the other people in the scene. It also tends to attend to objects the person interacts with, like the vaccum cleaner and coffee mugs. This makes sense as many actions in AVA such as talking, listening, hold an object etc. require focusing the faces, hands of people and objects to deduce.
Breaking down the performance: We now break down the performance of our model into certain bins. We start by evaluating the performance per class in Figure 5 (a). We sort the performance according the increasing amounts of training data, shown in green. While there is some correlation between the training data size and performance, we note that there exist many classes with enough data but poor performance, like smoking. We note that we get some of the largest improvement in classes such as sailing boat, watching TV etc, which would benefit from our Action Transformer model attending to the context of the person. Next, we evaluate the performance with respect to the size of the person in the clip, defined by the percentage area occupied by the GT box, in Figure 5 (b). For this, we split the validation set into bins, keeping predictions and GT within certain size limits. We find the size thresholds by sorting all the GT boxes and splitting into similar sized bins, hence ensuring similar ‘random’ performance for each bin. We find performance generally increases with bigger boxes, presumably because it becomes progressively easier to see what the person is doing up close. Finally, we evaluate the performance with respect to the number of GT boxes labeled in a clip in Figure 5 (c). We find decreasing performance as we add more people in a scene.
Qualitative Results: We visualize some successes of our model in Figure 6. Our model is able to exploit the context to recognize actions such as ‘watching a person’, which are inherently hard when just looking at the actor. Finally, we analyze some common failure modes of our best model in Figure 7. The columns show some common failure modes like (a) similar action/interaction, (b) identity and (c) temporal position. A similar visualization for all classes is provided in the supplementary material.
We have shown that the Action Transformer network is able to learn spatio-temporal context from other human actions and objects in a video clip, and localize and classify human actions. The resulting embeddings and attention maps (that are learned indirectly as part of the supervised action training) have a semantic meaning. The network exceeds the state-of-the-art on the AVA dataset by a significant margin. It is worth noting that previous state-of-the-art networks have used a motion/flow stream in addition to the RGB [7, 45], so adding flow as input is likely to boost performance also for the Action Transformer network. Nevertheless, as discussed in the previous section, performance is far from perfect, and we have suggested several avenues for improvement and investigation.
Authors would like to thank Viorica Patraucean, Relja Arandjelović, Jean-Baptiste Alayrac, Anurag Arnab, Mateusz Malinowski and Claire McCoy for helpful discussions and encouragement.
Activitynet leaderboard. spatio-temporal action localization (ava-1. computer vision only).http://activity-net.org/challenges/2018/evaluation.html.
Tube convolutional neural network (t-cnn) for action detection in videos.In ICCV, 2017.