DeepAI
Log In Sign Up

STAGE: Spatio-Temporal Attention on Graph Entities for Video Action Detection

12/09/2019
by   Matteo Tomei, et al.
33

Spatio-temporal action localization is a challenging yet fascinating task that aims to detect and classify human actions in video clips. In this paper, we develop a high-level video understanding module which can encode interactions between actors and objects both in space and time. In our formulation, spatio-temporal relationships are learned by performing self-attention operations on a graph structure connecting entities from consecutive clips. Noticeably, the use of graph learning is unprecedented for this task. From a computational point of view, the proposed module is backbone independent by design and does not need end-to-end training. When tested on the AVA dataset, it demonstrates a 10-16 baseline. Further, it can outperform or bring performances comparable to state-of-the-art models which require heavy end-to-end and synchronized training on multiple GPUs. Code is publicly available at: https://github.com/aimagelab/STAGE_action_detection.

READ FULL TEXT VIEW PDF

page 1

page 4

page 8

page 12

04/21/2022

A Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions

Spatio-temporal action detection is an important and challenging problem...
07/21/2022

An Efficient Spatio-Temporal Pyramid Transformer for Action Detection

The task of action detection aims at deducing both the action category a...
03/12/2022

Deformable VisTR: Spatio temporal deformable attention for video instance segmentation

Video instance segmentation (VIS) task requires classifying, segmenting,...
05/17/2019

Neural Message Passing on Hybrid Spatio-Temporal Visual and Symbolic Graphs for Video Understanding

Many problems in video understanding require labeling multiple activitie...
09/29/2022

4D-StOP: Panoptic Segmentation of 4D LiDAR using Spatio-temporal Object Proposal Generation and Aggregation

In this work, we present a new paradigm, called 4D-StOP, to tackle the t...
08/18/2021

Target Adaptive Context Aggregation for Video Scene Graph Generation

This paper deals with a challenging task of video scene graph generation...
06/14/2018

ReConvNet: Video Object Segmentation with Spatio-Temporal Features Modulation

We introduce ReConvNet, a recurrent convolutional architecture for semi-...

Code Repositories

STAGE_action_detection

Code of the STAGE module for video action detection


view repo

1 Introduction

Video action detection is a challenging task which requires an algorithm to detect and classify human actions in a video clip [39, 42, 53]. As such, it borrows from object detection, because of its localization requirements, and video classification, as an understanding of the temporal evolution inside frames is needed. Tackling the problem requires to address specific challenges that lie at the intersection between low-level and high-level video understanding.

Figure 1: We propose a graph-based module for video action detection which encodes relationships between multiple actors and objects in a spatio-temporal neighborhood.

Firstly, fine-grained and discriminative spatio-temporal features are needed to represent video chunks in a compact and manageable form. This has motivated recent efforts to design novel architectures for video feature extraction 

[40, 4, 41], which are often conceived for both video classification and action detection [7, 49]. Considerably, good spatio-temporal features can also promote an accurate localization of actions, which is a mandatory requirement for the task. On the other hand, detecting and understanding human actions is not just a matter of extracting middle-level features, and demands for more high-level reasoning.

Interestingly, indeed, the performances of video action detection networks that take inspiration from object detection architectures are still far from being satisfactory. This can be partly explained with their lack of proper context understanding, as they cannot model the relationships between actors and elements from the context [42]. Most importantly, the presence of other people in the scene, together with their behaviors, influences the understanding of the actor at hand. For example, it would be difficult to recognize whether a person is listening to someone just by looking at a bounding box around the first actor. Also, the understanding of actions is often linked to the presence of objects in the scene and their relationships with actors, as many human actions involve object handling.

Finally, a long-term understanding of the video is required to properly recognize actions and interactions between people. Actions can indeed have different temporal granularities, and this demands appropriate solutions to represent both small temporal variations and long-range temporal dependencies. While most feature extraction networks can properly handle the first case, managing long temporal extents must be tackled at an higher level of the pipeline.

Following these premises, in this paper we devise a novel approach for video action detection which can jointly detect and classify actions. Our approach focuses on high-level video modeling, and considers both high-level interactions between different people in the scene and interactions between actors and objects. Further, it also takes into account long-range temporal dependencies by connecting consecutive clips during learning and inference. Our solution is general enough to exploit existing backbones for feature extraction, and – most noticeably – it does not need end-to-end training on the backbone to obtain state-of-the-art results.

As shown in Fig. 1, our model builds upon a graph-based video representation, which is both spatial and temporal. Relationships inside the graph are learned by performing self-attention operations over nodes. To the best of our knowledge, the use of graph learning is unprecedented for video action detection. We train our model on the Atomic Visual Actions (AVA) dataset [13], which represents a challenging test-bed for recognizing human actions and exploiting the role of context. Experimentally, we demonstrate that our approach achieves results on pair with the state of the art, without end-to-end training of the feature extraction backbone, and therefore with greatly reduced computational requirements.

Contributions. To sum up, our contributions are as follows:

  • [noitemsep,topsep=0pt]

  • We propose a novel model for video action detection, which considers spatio-temporal relationships between actors and objects in consecutive clips to detect and classify human actions.

  • Our model is based on a spatio-temporal graph representation of the video, which is learned through self-attention operations. Further, it is independent from the feature extraction stage and does not need costly end-to-end training.

  • We perform extensive experiments on the challenging AVA dataset [13] to validate our approach and its components. Finally, we demonstrate that our model achieves state-of-the-art results without the need for an end-to-end training of the feature extraction backbone.

2 Related work

Deep networks for video understanding. CNNs are currently the state-of-the-art approach to extract spatio-temporal features for video processing and understanding [40, 4, 43, 51]. Most proposals integrate either full 3D convolutional kernels over time, or a combination of 2D spatial kernels and 1D temporal filters [8, 41, 32]. Despite the arousal of convolutional-based approaches, it is still a common practice to integrate both RGB and optical flow in two separate streams [36, 46, 9] to capture appearance and motion respectively. Li et al[27]

have recently proposed a 2D shared convolution which slides over the three 2D projections of a spatio-temporal tensor, while Hussein

et al[17] developed multi-scale temporal convolutions using different kernel sizes and dilation rates. Feichtenhofer et al[7] presented a two-pathway network, consisting of a low frame rate path capturing semantic information and a high frame rate path encoding motion. Differently from other proposals for video action detection, our approach has the additional benefit of being independent on the feature extraction stage and can be easily integrated with any video understanding backbone.

Spatio-temporal action localization. Video understanding networks are usually trained for action classification, i.e. predicting a single action occurring in a video clip, and several datasets have been proposed for this purpose [22, 31, 21, 26, 1]. Since widespread video clips usually involve multiple actors and different actions, two new tasks have been recently gaining attention to understand videos in a more detailed way: temporal action detection [3, 35, 52] and spatio-temporal action localization [13, 38, 37, 19]. While temporal action detection aims to segment the short temporal interval in which the action takes place and to classify it, action localization is intended to detect people in space and time and to classify their (possibly multiple) actions. For the second task, many approaches [49, 7] exploit human proposals coming from pre-trained image detectors [33] and replicate them in time to build straight spatio-temporal tubes; others extend image detection architectures to infer more precise tubelets [16, 20, 34, 12] before classifying actions. Gu et al[13], besides the AVA dataset, proposed a baseline exploiting an I3D network encoding RGB and flow data separately, along with a Faster R-CNN [33], to jointly learn action proposals and labels. Ulutan et al[42] suggested combining actors feature with every spatio-temporal region in the scene to produce attention maps between the actor and the context. Girdhar et al[11], instead, recently proposed a Transformer-style architecture [44] to weight actors with features from the context around the person.

Graph-based representations. Graph convolutional networks have been firstly proposed in [24] and later used in many applications. Graph attention networks [45] have extended this approach, enabling to specify different weights for different nodes through masked self-attention. Graph-based representations have been used in action recognition [2, 48, 54, 18] to model spatio-temporal relationships, although their application to video action detection is almost unprecedented. Wang et al[48] proposed to model a video clip as a combination of the whole clip features and the average proposal features, computed by a graph convolutional network based on similarities and spatio-temporal distances between RoIs. Zhang et al[54] defined the strength of a relation between two nodes as the inverse of the Euclidean distance between their features, after a feature transformation. Our approach employs self-attention to encode people and object relationships in a graph structure, and the spatio-temporal distance between proposals is used to solve spatio-temporal action localization.

3 Proposed approach

The goal of our approach is to localize each actor in a video clip and to classify his actions with a given temporal granularity . As the model outputs a number of detections along with predicted actions at specific keyframes, with a frequency of , the common way of handling this granularity is to temporally segment long videos into short clips (typically, 2-3 seconds long) centered in the keyframes, and to process them individually. As actions performed in a clip depend on actors and objects relationships through both space and time, we devise a graph representation in which actors and objects detections are nodes, and edges hold relationships between them. Further, we link graphs from subsequent clips in time, to find relations between clips belonging to the same longer video.

3.1 Graph-based clip representation

We propose a graph-based representation of each clip, where nodes consist of actors and objects features. Denoting the number of actors and objects belonging to the clip centered in keyframe as and respectively, the total number of entities of the graph is . Under this configuration, a clip can be represented as an matrix, where is the feature size.

Since actors can have meaningful relations both between them and with objects in the scene, we employ a fully-connected graph representation, in which all nodes are connected to the others, as the input of our network. Following the assumption that the closer an entity is to a person, the higher the probability that it affects its actions, a link between two entities in the graph is made stronger if they are spatially close. The graph configuration is therefore given by a dense

adjacency matrix , in which is defined as the proximity between entities and , computed as follows:

(1)

where is the Euclidean distance between the bounding boxes centers of entities and :

(2)

We propose an adjacency matrix representation which allows us to easily link graphs coming from subsequent clips of the same video, as will be explained in Sec. 3.3.

Figure 2: Architecture of our model with Temporal Graph Attention. Given consecutive clips, actors and objects are jointly encoded into a temporal and spatial-aware graph structure. The yellow box depicts a single graph-attention head, while the red box depicts the complete graph-attention layer.

3.2 Spatial-aware graph attention

Graph self-attention. We adopt a graph attention module, inspired by [45]. The input of the model consists of actors and objects features, with

. First of all, the module applies a linear transformation to these features, in order to obtain a new representation of each entity

, . Then a self-attention operator is applied to the nodes. In particular, the operator is defined as , so that:

(3)

with the scalar representing the importance of the features of entity to those of entity . Since we propose to represent a clip as a fully-connected graph, is computed for each pair of entities belonging to the same clip, avoiding the need for masking disconnected couples. Based on the original graph attention implementation [45], is a feedforward layer with parameters, followed by a LeakyReLU nonlinearity:

(4)

where indicates concatenation on the channel axis and is a linear layer. The resulting matrix will be a squared matrix with the same shape as the adjacency matrix . Separating it into its components, it can be rewritten as:

(5)

where is the matrix of actors weights to actors, is the matrix of objects weights to actors, is the matrix of actors weights to objects and is the matrix of objects weights to objects.

Introducing spatial proximity. The proposed self-attention module, when applied to a clip graph, computes the mutual influence of two entities in feature space, i.e. the influence of an entity on another based on their features. However, it does not consider the adjacency matrix and the mutual distances between entities.

To introduce the prior given by the spatial proximity inside the clip, we condition the previously computed self-attention matrix with the adjacency matrix , which contains the proximity between detections, by taking their Hadamard product, i.e.:

(6)

This operation allows us to strengthen the importance of the features of an entity with respect to its neighbors and to weaken relations between entities which lie far from each other. A row-wise softmax normalization is then applied to obtain an importance distribution over entities:

(7)

The updated features computed by the module are a linear combination of the starting features using as coefficients. In particular, the self-attention module updates the initial features as follows:

(8)

where is an ELU nonlinearity [5], following the original GAT [45] implementation.

3.3 Temporal graph attention

In this section, we extend the proposed attention-based approach to encode consecutive clips. Since different clips are not required to have the same number of entities (actors and objects), we propose a single adjacency matrix

with as many rows and columns as the total number of entities in all the batch’s clips. Besides allowing us to manage clips with a variable number of entities, this solution is suitable to link more graphs together and avoids padding. An example is shown in Fig. 

3, where we have three clips per batch: dark red elements contain the proximity between actors of the same clip, dark blue elements contain the proximity between objects of the same clip, and dark violet elements contain the proximity between actors and objects of the same clip.

We can easily link entities belonging to subsequent clips by computing their boxes proximity (Eq. 1), assuming that the temporal granularity is small enough to ensure the consistency of the scene between two adjacent clips. The light-colored elements of in Fig. 3 contain the proximity between actors (light red), objects (light blue) and actors/objects (light violet) belonging to two consecutive clips. White elements are zeros.

One could increase the temporal extent by simply replacing zeros with the proximity between temporally distant entities. The insight behind this solution is that we can consider a batch of successive clips as a single graph, where the presence of a direct edge between two entities is given by their temporal distance, while the strength of this edge is given by their spatial distance.

Self-attention over time. Since the size of is now , where and is the number of clips per batch, we need the matrix computed in Eq. 4 to be too, in order to make the Hadamard product of Eq. 6 feasible. For this reason, as shown in the yellow background portion of Fig. 2, we compute the importance of an entity to all the other entities belonging to clips of the batch. Therefore, in our implementation, the self-attention module computes attention weights for each pair of entity features, without any masking. For a three clips per batch setting, the complete wights matrix looks like the following:

(9)

where is the weights matrix of actors belonging to clip to actors belonging to clip , is the weights matrix of objects belonging to clip to actors belonging to clip , and so on.

The Hadamard product:

(10)

is in charge of strengthening or weakening weights between entities belonging to the same timestamp or sufficiently close in time, and to zero weights between temporally distant entities. Finally, the linear combination of Eq. 8 replaces features of an entity with a weighted sum of features directly connected to it in the graph: these features come now from entities belonging to the same clip and to temporally close clips.

Multi-head multi-layer approach. As will be analyzed in Sec. 4.2, stacking the output of different graph attention heads and using a cascade of layers leads to improved performance. The graph attention head and the graph attention layer are illustrated in the yellow and red background sections of Fig. 2, respectively.

Figure 3: Adjacency matrix in a three clips per batch configuration, containing the spatial proximity between entities belonging to the same clip (dark-colored sub-matrices) and to consecutive clips (light-colored sub-matrices). Red stands for actor-actor, blue stands for object-object and violet stands for actor-objects. White elements are zeros. Box center’s coordinates are indicated as and .

It is worth noting that the number of graph attention layers affects the temporal receptive field in the graph. Considering a temporal extent of (corresponding to a graph where entities are directly connected only with other entities of the same clip and to entities from the previous and following clips), each layer after the first one increases the temporal receptive field by . In a two layers setting, for instance, the second graph attention layer will compute the features of the first clip as a weighted sum of the features belonging to the first and the second clip, but features from the second clip have already been affected by features from the third clip in the first graph attention layer.

4 Experimental results

All the experiments we performed to validate the effectiveness of our method start from a pre-trained backbone, which is kept fixed during training. The pre-trained backbones take raw clips as input and output features for each entity. Person boxes come from a stand-alone person detector (and also from ground-truth boxes during training) applied to the keyframe and are replicated in time to let RoIAlign [14] compute RoI features. The backbone is always trained on the Kinetics-400 dataset [22] and fine-tuned on the AVA dataset [13].

Datasets and Metrics. We evaluate our model on the version 2.1 of the challenging AVA dataset [13]. AVA aims to localize people both in space and time and to predict their actions. AVA consists of 235 training and 64 validation movie videos, each 15 minutes long. The temporal granularity in AVA is 1 second, which means that ground-truth boxes and labels are available for one frame per second, leading to 211K training and 57K validation clips centered in these keyframes. Each actor is involved in one or more of the 80 atomic action classes. AVA’s main challenge concerns its long-tail property: tens of thousands of samples are available for some classes while only a few dozen for others. The performance of a model on AVA is measured by a keyframe-level mean average precision (mAP) with 50% IoU threshold. Following authors suggestion [13] and prior works, we train our architecture on all the 80 AVA classes, but we evaluate its performance only on the 60 classes with at least 25 validation examples.

Detection architecture and performance. We use a Faster R-CNN [33] with a ResNeXt-101-FPN [15, 29, 50] as people detector, applied to keyframes. The average precision of this model for the person class is reported in Table 1. In all our experiments, we use Faster R-CNN pre-trained on COCO [30] boxes and fine-tuned on AVA [13] boxes to detect people. Actors features come from a 3D CNN backbone (discussed in the next section), replicating previously computed boxes in time to obtain a 3D RoI and applying RoIAlign [14], following previous works [13, 7]. Objects features, instead, come from a Faster R-CNN pre-trained on the Visual Genome dataset [25].

Boxes pre-train Train AP@50 Val AP@50
COCO [30] 92.3 90.9
COCO [30]+AVA [13] 97.0 94.9
Table 1: Detection architecture average precision on the keyframes, for the person class. Action recognition is much more challenging than actor localization in AVA.

Backbones setup. We evaluate our method using two different fixed backbones for the features extraction process. The I3D [4]

backbone is trained on ImageNet 

[6] before being ‘inflated’ and on the Kinetics-400 dataset [22] after. RoIAlign [14] is applied after the Mixed_4f layer and we fine-tune only the last layers (from the Mixed_5a layer to the final linear classifier) on the AVA dataset [13]

for 10 epochs. We also extract features from the R101-I3D-NL backbone adopted in 

[49], which is pre-trained on ImageNet [6] and Kinetics-400 datasets [22] too, but fine-tuned end-to-end on the AVA dataset by the authors. For the I3D backbone we use ground-truth boxes and predicted boxes with any score during features extraction, assigning labels of a ground-truth box to a predicted box if their IoU is 0.5 or more. We use predicted boxes with score at least 0.7 for the evaluation. Following the authors implementation [49], for the R101-I3D-NL backbone we use ground-truth boxes and predicted boxes with score at least 0.9 during features extraction, assigning labels of a ground-truth box to a predicted box if their IoU is 0.9 or more. We use predicted boxes with score at least 0.85 for the evaluation. Features always come from the last layer of the backbone before classification, after averaging in space and time dimensions: feature size is 1024 and 2048 for I3D and R101-I3D-NL respectively.

Implementation and training details. Our model takes pre-computed features of size as input and computes 81 independent class probabilities (80 AVA actions plus a background class) for each actor (objects are removed before the last linear in Fig. 2). We also explicitly add bounding boxes height, width and center coordinates to features, as we found this beneficial.

Each graph attention head consists of two fully-connected layers. The first one reduces the features size depending on the number of heads used in that layer: using heads means an output features size of

. This makes possible the following residual connection in the red background section of Fig. 

2, after the concatenation of the output of every head. The second linear layer computes attention weights (Eq. 4). A graph attention layer consists of a dropout, a number of graph attention heads whose outputs are concatenated and a fully connected layer followed by a residual block and a layer normalization block. Finally, one last linear layer gives the independent per-class probabilities and a sigmoid cross-entropy loss is computed. During training, batches are built from temporally consecutive clips. In our experiments, we adopt a temporal extent of 1, directly connecting entities of the same clip and belonging to consecutive clips in the graph.

We used a batch size of 6 clips for I3D features and a batch size of 8 clips for R101-I3D-NL features. Adam optimizer [23] is adopted in all our experiments, with a learning rate of for I3D features and for R101-I3D-NL features. Learning rate is decreased by a factor of 10 after the validation mAP does not increase for ten consecutive epochs. Training is stopped when the mAP does not increase for five consecutive epochs after reducing the learning rate. All the experiments are performed on a single V100 GPU: on average, a single experiment takes about 20-30 epochs to converge, and lasts less than a day.

Model fixed backbone mAp@50
AVA [13] 15.6
ACRN [39] 17.4
STEP [53] 18.6
Better baseline [10] 21.9
SMAD [54] 22.2
RTPR [28] 22.3
ACAM [42] 22.7
VATX [11] 24.9
SlowFast [7] 26.3
FBO (R101-I3D-NL) [49] 26.8
FC (I3D [4]) 19.7
STAGE (I3D [4]) 23.0
FC (R101-I3D-NL [49]) 23.9
STAGE (R101-I3D-NL [49]) 26.3
Table 2: Mean average precision comparison between our method and previous approaches on the AVA validation set. Despite the lack of an end-to-end fine-tuning, our module outperforms many previous works, reaching performances comparable to state-of-the-art architectures. “FC” indicates a fully connected classifier.
Model fixed backbone Person Pose Person-Person Interaction Person-Object Interaction
I3D backbone [4] + FC 37.4 20.4 12.2
SMAD [54] 41.9 22.0 14.3
ACAM [42] 42.5 23.5 13.3
I3D backbone [4] + Our 40.4 23.5 15.7
R101-I3D-NL [49] + FC 41.4 26.5 15.5
R101-I3D-NL [49] + Our 43.4 29.0 18.1
Table 3: Performances reported as the mean average precision for AVA person pose classes, person-person interaction classes and person-object interaction classes. Our module improves performances especially for classes involving interactions, which are the majority in AVA.

4.1 Main results

Since fine-tuning the whole 3D convolutional backbone would require a large number of GPU hours, we opt for a more computationally efficient solution: we train only our attention-based module, starting from pre-computed features. Experiments reveal how we are able to improve backbone performance by 10%-16%, reaching results comparable to those of state-of-the-art methods which fine-tune the whole network using synchronized training on 8-128 GPUs. It is also important to note that our graph-attention block is trained without any data augmentation (except considering both ground truth and predicted boxes during feature extraction), while other end-to-end approaches often adopt random flipping, random scaling and random cropping.

Table 2 shows the mean average precision with 50% IoU threshold for our method, considering both backbones, and for a number of competitors. We observe a relative improvement of more than 16% for I3D backbone (19.7 23.0) and about 10% for the R101-I3D-NL backbone (23.9 26.3), when replacing the linear classifier with our graph-attention block. The slightly lower gap in the second case is probably due to the presence of non-local operations [47] in the backbone, which can capture some relations between entities. As it can be observed, our results using the I3D backbone are superior to many approaches which employ the same backbone and train end-to-end. Further, the result obtained with the R101-I3D-NL backbone is on pair with the performance of the recent SlowFast [7] network trained end-to-end, and close to the best published results on AVA, which adopts end-to-end training as well. This underlines that the importance of modelling high-level entities in the video is at least on pair with the importance of extracting better spatio-temporal features.

Table 3 shows performances grouped by action type in AVA: person-pose (13 classes), person-person interaction (15 classes) and person-object interaction (32 classes). As it can be seen, our model improves the mAP especially for actions consisting of interactions between entities, thus confirming the benefit of modelling interactions between entities. At the same time, we still observe a slight improvement in performance on pose classes, which are not explicitly modelled by our approach.

Finally, per-class performances are reported in Fig. 4, where classes are sorted by decreasing number of instances in the training set. Despite performances seem to increase with the number of training samples, there are some cases which do not follow this trend: some classes with many examples are still difficult to recognize (e.g. touch (an object), which is a very generic class, or smoke, which requires to recognize small objects, like cigarettes) while others with fewer instances are easier to classify (e.g. drive, swim).

Figure 4: Per-class performances of our module compared to a fully-connected classifier, both trained starting from pre-computed I3D [4] features. Classes with the highest absolute gain are watch (e.g., TV) (+14.0 AP), listen to (a person) (+10.7 AP), play musical instrument (+ 10.7 AP), all involving interactions with other objects and/or actors.

4.2 Ablation study

To validate the importance of all the choices made in the graph-attention layer, we run several ablation experiments. Table 4 shows the effect of varying the number of graph-attention heads and layers, using I3D features. We observe a bigger gap in the validation mAP when changing the number of layers, compared to changing the number of heads. Overall, the best configuration obtained after a grid validation is 4-heads and 2-layers for I3D features, and 2-heads and 2-layers for R101-I3D-NL features. Further, in Table 5 we report the validation mAP obtained training our block and removing some key components, starting from pre-computed I3D features. Mean average precision drops when removing the spatial prior between detections, the temporal links between consecutive clips and when considering only actors nodes and not objects. We also investigate the use of dot-product attention, by replacing the weights of Eq. 4 with weights computed through the Transformer self-attention [44], as follows:

(11)

where , and come from three separated linear transformations of the input features . In this setting, however, we observe a significant drop in performance. Finally, increasing the temporal extent by reducing the number of zero-blocks in Fig. 3, does not seem to further increase performance.

Heads  Layers 1 2 3
2 21.2 22.7 22.0
4 21.7 23.0 22.8
8 21.7 22.7 21.9
Table 4: Validation mAP obtained considering different numbers of graph-attention heads and layers. The I3D backbone features are used.
I3D backbone + mAP@50
STAGE 23.0
STAGE w/o boxes proximity 21.5
STAGE w/o temporal extent 22.1
STAGE w/o objects features 22.2
STAGE w/ Transformer [44] attention 21.6
Table 5: Validation mAP obtained removing or changing some key components in our module. We replace the proximity with ones in the adjacency matrix, we remove temporal connections replacing light colored sub-matrices of Fig. 3 with zeros, we remove objects features to consider only actors and finally we perform Transformer’s self-attention [44] instead of GAT self-attention [45].
Answer phone Drive Sail boat Eat Ride Read
Play musical instr. Smoke Carry/hold (an object) Hug (a person) Kiss (a person) Run/jog
Figure 5: Qualitative results showing the keyframes of the evaluated clips, with red boxes denoting actors performing actions, and blue boxes denoting objects.

4.3 Qualitative analysis

We present some qualitative results obtained on clips of the AVA validation set in Fig. 5. Here, we only show the central keyframe of the clip; red and blue boxes represent predicted actors and objects respectively. For simplicity, we highlight only the actor involved in the action (despite other actors could be found in the scene), except for the Hug and Kiss classes, where two actors perform the same action. Only predicted objects with score greater than are reported, despite we used them all during training.

As it can be seen, our approach is capable of detecting actions which involve relationships with objects and other people. On average, we qualitatively observe that our spatio-temporal graph-based module is able to improve the recognition of human actions. This further confirms the role of modelling high-level relationships between entities.

5 Conclusion

In this work we presented STAGE, a novel graph-attention module which can be easily integrated in any video understanding backbone. The module computes updated actor features based on neighboring entities in both space and time. This is done considering consecutive clips as a single learnable graph, where actors and objects are the nodes, while edges hold their relationships. Temporal distance between entities determines the presence of a direct edge between them in the graph, while spatial distance, along with attention weights, defines its strength. Through a number of experiments on the Atomic Visual Actions (AVA) dataset, we demonstrate that our module can bring performances better than or comparable to those of state-of-the-art methods, especially for interaction-based classes, even without an end-to-end training.

References

  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §2.
  • [2] W. Brendel and S. Todorovic (2011) Learning spatiotemporal graphs of human activities. In

    Proceedings of the International Conference on Computer Vision

    ,
    Cited by: §2.
  • [3] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §2.
  • [4] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, Figure 4, Table 2, Table 3, §4.
  • [5] D. Clevert, T. Unterthiner, and S. Hochreiter (2016) Fast and accurate deep network learning by exponential linear units (elus). In Proceedings of the International Conference on Learning Representations, Cited by: §3.2.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.
  • [7] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proceedings of the International Conference on Computer Vision, Cited by: Table 7, §1, §2, §2, §4.1, Table 2, §4.
  • [8] C. Feichtenhofer, A. Pinz, and R. Wildes (2016) Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [9] C. Feichtenhofer, A. Pinz, and A. Zisserman (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [10] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2018) A better baseline for ava. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: Table 7, Table 2.
  • [11] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2019)

    Video action transformer network

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 7, §2, Table 2.
  • [12] G. Gkioxari and J. Malik (2015) Finding action tubes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [13] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) AVA: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.3, 3rd item, §1, §2, Table 1, Table 2, §4, §4, §4, §4.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the International Conference on Computer Vision, Cited by: §4, §4, §4.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.
  • [16] R. Hou, C. Chen, and M. Shah (2017)

    Tube convolutional neural network (t-cnn) for action detection in videos

    .
    In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [17] N. Hussein, E. Gavves, and A. W. Smeulders (2019) Timeception for complex action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [18] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena (2016)

    Structural-rnn: deep learning on spatio-temporal graphs

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [19] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black (2013) Towards understanding action recognition. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [20] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid (2017) Action tubelet detector for spatio-temporal action localization. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [21] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [22] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §2, §4, §4.
  • [23] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, Cited by: §4.
  • [24] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. Proceedings of the International Conference on Learning Representations. Cited by: §2.
  • [25] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §4.
  • [26] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [27] C. Li, Q. Zhong, D. Xie, and S. Pu (2019) Collaborative spatiotemporal feature learning for video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [28] D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei (2018) Recurrent tubelet proposal and recognition networks for action detection. In Proceedings of the European Conference on Computer Vision, Cited by: Table 2.
  • [29] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.
  • [30] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision, Cited by: Table 1, §4.
  • [31] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, Y. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick, et al. (2019) Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • [32] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [33] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §2, §4.
  • [34] S. Saha, G. Singh, and F. Cuzzolin (2017) Amtnet: action-micro-tube regression by end-to-end trainable deep architecture. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [35] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [36] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [37] K. Soomro and A. R. Zamir (2014) Action recognition in realistic sports videos. In Computer vision in sports, pp. 181–208. Cited by: §2.
  • [38] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §2.
  • [39] C. Sun, A. Shrivastava, C. Vondrick, K. Murphy, R. Sukthankar, and C. Schmid (2018) Actor-centric relation network. In Proceedings of the European Conference on Computer Vision, Cited by: Table 7, §1, Table 2.
  • [40] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the International Conference on Computer Vision, Cited by: §1, §2.
  • [41] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [42] O. Ulutan, S. Rallapalli, M. Srivatsa, and B. Manjunath (2018) Actor conditioned attention maps for video action detection. arXiv preprint arXiv:1812.11631. Cited by: Table 7, §1, §1, §2, Table 2, Table 3.
  • [43] G. Varol, I. Laptev, and C. Schmid (2017) Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6), pp. 1510–1517. Cited by: §2.
  • [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §2, §4.2, Table 5.
  • [45] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio (2018) Graph attention networks. In Proceedings of the International Conference on Learning Representations, Cited by: §2, §3.2, §3.2, Table 5.
  • [46] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [47] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.2, §4.1.
  • [48] X. Wang and A. Gupta (2018) Videos as space-time region graphs. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [49] C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick (2019) Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Figure 6, §A.2, Table 7, §1, §2, Table 2, Table 3, §4.
  • [50] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.
  • [51] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision, Cited by: §2.
  • [52] H. Xu, A. Das, and K. Saenko (2017) R-c3d: region convolutional 3d network for temporal activity detection. In Proceedings of the International Conference on Computer Vision, Cited by: §2.
  • [53] X. Yang, X. Yang, M. Liu, F. Xiao, L. S. Davis, and J. Kautz (2019) STEP: spatio-temporal progressive learning for video action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, Table 2.
  • [54] Y. Zhang, P. Tokmakov, M. Hebert, and C. Schmid (2019) A structured model for action detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Table 7, §2, Table 2, Table 3.

Appendix A Supplementary material

In the following sections we provide additional specifics about our STAGE module, its computational requirements, additional implementation details and further quantitative and qualitative results.

Figure 6: Per-class performances of our module compared to a fully-connected classifier, both trained starting from pre-computed R101-I3D-NL [49] features. Classes with the highest absolute gain are play musical instrument (+8.5 AP), sing to (e.g., self, a person, a group) (+8.2 AP), work on a computer (+ 7.9 AP).

a.1 Implementation details

As an integration to Sec. 4, Table 6 lists the learnable blocks of our architecture, together with their input and output shapes, in a 2-layers 4-heads setting. The keep probability is 0.5 for both the dropout layers in Fig. 2. The alpha parameter of the LeakyReLU in Eq. 4 is 0.2. Since our batches consist of consecutive clips coming from the same video, during training we use a shuffle strategy which preserves contiguity inside the mini-batch. Clips are sorted with respect to their timestamp, and mini-batches are built by randomly chunking the sorted list.

a.2 Additional results

Fig. 6 integrates Fig. 4 by showing per-class average precision using the R101-I3D-NL [49] backbone. As it can be seen, our module is still able to improve performance with a significant margin with respect to a fully-connected classifier, despite the presence of non-local operations [47] in the backbone. Fig. 7 shows additional qualitative results, highlighting both actors and objects, as explained in Sec. 4.3, while Fig. 8 shows sample failure cases.

a.3 Computational analysis

Our module reaches performances comparable to state-of-the-art architectures without requiring an end-to-end training of the backbone. This significantly reduces computational requirements, since the convolutional backbone incorporates most of the model complexity. Table 7 shows a comparison between our module and a number of competitors which employ existing or novel backbones with end-to-end training. For each approach, we report the number of GPUs used during training, the batch size per GPU, the number of epochs and training time. The comparison is based on the implementation details reported in the original papers when training on the AVA [13] dataset. Our module requires a single GPU for training when pre-extracting backbone features, and less than a day to converge.

Stage Module Input size Output size
Table 6: Architecture of STAGE, showing learnable building blocks in a 2-layers 4-heads configuration. indicates the -th graph-attention-layer, consisting of 4 graph-attention-heads (each with 2 fully-connected layers), a linear layer and a LayerNorm. is the number of entities in the batch, is the number of actors.
Model fixed backbone # GPUs for training Batch size per GPU Training epochs Training time
ACRN [39] 11 1 63 -
Better baseline [10] 11 3 78 -
SMAD [54] 8 2 11 -
ACAM [42] 4 2 70 -
VATX [11] 10 3 71 7 days
SlowFast [7] 128 - 68 -
FBO (R101-I3D-NL) [49] 2 days
STAGE (R101-I3D-NL [49]) 1 8 20 1 day
Table 7: Comparison of computational requirements between our proposal and other approaches. The FBO (R101-I3D-NL) model uses two instances of the backbone, thus requiring twice the complexity of its base model.
Take a photo Write Work on a computer Watch (e.g., TV) Drink Open
Hand shake Give/serve Martial art Dance Swim Climb
Fight/hit (a person) Lie/Sleep Watch (a person) Grab (a person) Crawl Walk
Figure 7: Qualitative results showing keyframes of evaluated clips, with red boxes denoting actors performing actions, and blue boxes denoting objects.
Smoke Drive Read Write Answer phone Sail boat
Figure 8: Sample failure cases, where the spatio-temporal graph attention does not seem to help. Actions are sometimes assigned to the wrong actor if he is very close to the one actually performing the action (Smoke, Drive, Read); some actions are misclassified, if they are very similar to others (Write instead of Cut, Answer phone instead of eat); some object-interaction classes are wrongly assigned to people very close to the objects (Sail boat).