Log In Sign Up

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

Dynamic scene graph generation aims at generating a scene graph of the given video. Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation. In this paper, we propose Spatial-temporal Transformer (STTran), a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input in order to capture the temporal dependencies between frames and infer the dynamic relationships. Furthermore, STTran is flexible to take varying lengths of videos as input without clipping, which is especially important for long videos. Our method is validated on the benchmark dataset Action Genome (AG). The experimental results demonstrate the superior performance of our method in terms of dynamic scene graphs. Moreover, a set of ablative studies is conducted and the effect of each proposed module is justified.


page 1

page 6

page 8


Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

Structured video representation in the form of dynamic scene graphs is a...

SSGVS: Semantic Scene Graph-to-Video Synthesis

As a natural extension of the image synthesis task, video synthesis has ...

Profiling Visual Dynamic Complexity Using a Bio-Robotic Approach

Visual dynamic complexity is a ubiquitous, hidden attribute of the visua...

Identifying Most Walkable Direction for Navigation in an Outdoor Environment

We present an approach for identifying the most walkable direction for n...

Transforming Image Generation from Scene Graphs

Generating images from semantic visual knowledge is a challenging task, ...

Visual Deprojection: Probabilistic Recovery of Collapsed Dimensions

We introduce visual deprojection: the task of recovering an image or vid...

TBN-ViT: Temporal Bilateral Network with Vision Transformer for Video Scene Parsing

Video scene parsing in the wild with diverse scenarios is a challenging ...

1 Introduction

A scene graph is a structural representation that summaries objects of interest as nodes and their relationships as edges [24, 26]

. Recently, scene graphs have been successfully applied in different vision tasks, such as image retrieval

[24, 40], object detection, semantic segmentation, human-object interaction [15], image synthesis [22, 3]

, and high-level vision-language tasks like image captioning

[13, 55] or visual question answering (VQA) [23]. It is treated as a promising approach towards holistic scene understanding and a bridge connecting the large gap between vision and natural language domains. Therefore, the task of scene graph generation has caught increasing attention in communities.

Figure 1: The difference between scene graph generation from image and video. In the video, the person is watching TV and drinking water from the bottle. Dynamic Scene graph generation can utilize both spatial context and temporal dependencies (3rd row) compared with image-based scene graph generation (2nd row). Nodes in different colors denote objects (person,bottle,tv) in the frames.

While the great progress made in scene graph generation from a single image (static scene graph generation), the task of scene graph generation from a video (dynamic scene graph generation) is a new and more challenging task. The most popular approach of static scene graph generation is built upon an object detector that generates object proposals, and then infers their relationship types as well as their object classes. However, objects are not sure to be consistent in each frame of the video sequence and the relationships between any two objects may vary because of their motions, which is characterized by dynamic. In this case, temporal dependencies play a role, and thus, the static scene graph generation methods are not directly applicable to dynamic scene graph generation, which has been fully discussed in [20] and verified by the experimental results analyzed in Sec. 4. Fig. 1 showcases the difference between scene graph generation from image and video.

Action recognition is an alternative to detect the dynamic relationships between objects. However, actions and activities are typically treated as monolithic events that occur in videos in action recognition [4, 25, 41, 60]. It has been studied in Cognitive Science and Neuroscience that people perceive an ongoing activity by segmenting them into consistent groups and encoding into a hierarchical part structure [27]. Let’s take the activity ”drinking water” as an example, as shown in Fig. 1. The person starts this activity by holding the bottle in front of her, and then holds it up and takes water. More complex, the person is looking at the television at the same time. Decomposition of this activity is useful for understanding how it happens and what is going on. Associating with the scene graph, it is possible to predict what will happen: after the person picks up the bottle in front of her, we can predict that the person is likely to drink water from it. Representing temporal events with structured representations, dynamic scene graph, could lead to more accurate and grounded action understanding. However, most of the existing methods for action recognition are not able to decompose the activity in this way.

In this paper, we explore how to generate a dynamic scene graph from sequences effectively. The main contributions of this work are summarized as: (1) We propose a novel framework, termed as Spatial-Temporal Transformer (STTran), which encodes the spatial context within single frames and decodes visual relationship representations with temporal dependencies across frames. (2) Distinct from the majority of related works, multi-label classification is applied in relationship prediction and a new strategy to generate a dynamic scene graph with confident predictions is introduced. (3) With several experiments, we verify that temporal dependencies have a positive effect on relationship prediction and our model improves performance by understanding it. STTran achieves state-of-the-art results on Action Genome [20]. The source code will be made publicly available after publication.

2 Related Work

Scene Graph Generation

Scene graph has first been proposed in [24]

for image retrieval and caught increasing attention in Computer Vision community 

[36, 31, 9, 33, 45, 49, 55, 57]. It is a graph-based representation describing interactions between objects in the image. Nodes in the scene graph indicate the objects while edges denote the relationships. The applications include image retrieval [40], image captioning [1, 39], VQA [45, 23] and image generation [22]. In order to generate high-quality scene graphs from images, a series of works explore different directions such as utilizing spatial context [58, 34], graph structure [54, 52, 30], optimization [8]

, reinforce learning

[32, 45], semi-supervised training [7] or a contrastive loss [59]. These works have achieved excellent results on image datasets [26, 36, 28]. Although it is universal for multiple relationships to co-occur between a subject-object pair in the real world, the majority of previous works defaults to edge prediction as single-label classification. Despite the progress made in this field, all these methods are designed for static images. In order to extend the gain brought by scene graphs in images to video, Ji et al. [20] collect a large dataset of dynamic scene graphs by decomposing activities in videos and improve state of the art results for video action recognition with dynamic scene graph.

Transformer for Computer Vision

The vanilla Transformer architecture was proposed by Vaswani [48]

for neural machine translation. Many transformer variants are developed and have achieved great performance in language modeling tasks, especially the large-scale pre-trained language models, like GPT

[38] and BERT [10]. Then, Transformers have also been widely and successfully applied in many vision-language tasks, such as image captioning [53, 18], VQA [2, 56]. To further bridge the vision and language domains, different Bert-like large-scale pre-trained models are also developed, like Caption-Based Image Retrieval and Visual Commonsense Reasoning (VCR) [37, 29, 44]. Most recently, Transformers are attracting increasing attention in the vision community. DETR is introduced by Carion [5] for object detection and panoptic segmentation. Moreover, Transformers are explored to learn vision features from the given image instead of the traditional CNN backbones and achieve promising performance [12, 46]. The core mechanism of Transformer is its self-attention building block which is able to make predictions by selectively attending to the input points (each point can be a word representation of a sentence or a local feature from an image), so that context is captured between different input points and the representation of each point is refined. Nonetheless, the above methods focus on learning spatial context with a transformer from a single image while temporal dependencies play a role in video understanding. Action Transformer is proposed by Girdhar [14] that utilizes transformer to refine the spatio-temporal representations, which are learned by I3D model [6] and then pooled from the RoI given by a RPN network [39], for recognizing human actions in video clips. In fact, the transformer module is still used to learn spatial context. VisTR is introduced in [51] for video segmentation. The features of each frame that are extracted by a CNN backbone are fed to a transformer encoder to learn the temporal information of a video sequence.

Spatial-Temporal Networks

Spatial-temporal information is the key to access video understanding and has been long and well studied. To date, the most popular approaches are RNN/LSTM-based [19] or 3D ConvNets-based [21, 47] structures. The former takes features from each frame sequentially and learns the temporal information [43, 11]. The latter extends the traditional 2D convolution (height and width dimension) to time dimension for sequential inputs. Simonyan [42]

introduce a two-stream CNN structure that spatial and temporal information is learned on different streams respectively. Residual connections are inserted between the two information streams to allow information fusion. Then, the 2D convolution in the two-stream structure is inflated into its counterpart 3D convolution, dubbed I3D model

[6]. Non-local Neural Networks [50] introduce another kind of generic self-attention mechanism, non-local operation. It computes relatedness between different locations in the input signal and refines the inputs by weighted sum of different inputs based on the relatedness. Their method is easy to be applied in video input by extending the non-local operation along the time dimension. However, these works are applied for activity recognition and are not able to decompose the activity into consistent groups. In this work, we do not only utilize transformer to learn spatial context between objects within a frame, but also the temporal dependencies between frames to infer the dynamic relationships varying along the time axis.

3 Method

A dynamic scene graph can be modeled as a static scene graph with an extra index representing the relations over time as an extra temporal axis. Inspired by the transformer characteristics: (1) the architecture is permutation-invariant, and (2) the sequence is compatible with positional encoding, we introduce a novel model, Spatial-Temporal Transformer (STTran), in order to utilize the spatial-temporal context along videos (see Fig. 2).

Figure 2:

Overview of our method: the object detection backbone proposes object regions in RGB video frames and the relationship feature vectors are pre-processed (Sec. 

3.2). The encoder of the proposed Spatial-Temporal Transformer (Sec. 3.3

) first extracts the spatial context within single frames. The relation representations refined by encoder stacks from different frames are combined and added to learned frame encodings. The decoder layers capture temporal dependencies and relationships are predicted with linear classifiers for different relation type (such as

, , ). indicates element-wise addition while FFN stands for feed-forward network.

3.1 Transformer

First, we take a brief review on the transformer structure. The transformer is proposed by Vaswani [48] and consists of a stack of multi-head dot-product attention based transformer refining layer. In each layer, the input that has entries of dimensions, is transformed into queries (, ), keys (, ) and values (,

) though linear transformations. Note that,

, and are the same in the implementation normally. Then, each entry is refined with other entries through dot-product attention defined by:


To improve the performance of the attention layer, multi-head attention is applied which is defined as :


A complete self-attention layer contains the above self-attention module followed by a normalization layer with residual connection and a feed-forward layer, which is also followed by a normalization layer with residual connection. For simplicity, we denote such a self-attention layer as . In this work, we design a Spatio-Temporal Transformer based on to explore spatial context, which works on a single frame, and temporal dependencies that work on sequence, respectively.

3.2 Relationship Representation

We employ Faster R-CNN [39] as our backbone. For the frame at time step in a given video with frames , the detector provides visual features , bounding boxes and object category distribution of object proposals where indicates the number of object proposals in the frame. Between the object proposals there is a set of relationships . The representation vector of the relation between the -th and -th object proposals contains visual appearances, spatial information and semantic embeddings, which can be formulated as:


where is concatenation operation, is flattening operation and is element-wise addition. , and represent the linear matrices for dimension compression. indicates the feature map of the union box computed by RoIAlign [16] while is the function transforming the bounding boxes of subject and object to an entire feature with the same shape as . The semantic embedding vectors , are determined by the object categories of subject and object. The relationship representations exchange spatial and temporal information in Spatial-Temporal Transformer.

3.3 Spatio-Temporal Transformer

The Spatio-Temporal Transformer maintains the original encoder-decoder architecture [48]. The difference is, the encoder and decoder are delegated the more concrete tasks.

Spatial Encoder

concentrates on the spatial context within a frame whose input is a single . The queries , keys and values share the same input and the output of the -th encoder layer is presented as:


The encoder consists of N identical layers that are stacked sequentially. The input of the -th layer is the output of the -th layer. For simplicity, we remove the superscript in the following discussion. Unlike the majority of transformer methods, no additional position encoding is integrated into the inputs since the relationships within a frame are intuitively parallel. Having said that, the spatial information hiding in the relation representations (see Eq. 3) plays a crucial role in the self-attention mechanism. The final output of the encoder stacks is sent to the Temporal Decoder.

Frame Encoding

is introduced for the temporal decoder. Without convolution and recurrence, the knowledge of sequence order such as positional encoding must be embedded in the input for the transformer. In contrast to the word position in [48] or the pixel position in [5], we customize the frame encodings to inject the temporal position in the relationship representations. The frame encodings are constructed with learned embedding parameters, since the amount of the embedding vectors depending on the window size in the Temporal Decoder is fixed and relative short: where are the learned vectors with the same length as .

The widely used sinusoidal encoding method is also analyzed (see Table 5). We adopt the learned encoding method because of its overall better performance. The window size is fixed and therefore the video length does not affect the length of frame encodings.

Temporal Decoder

captures the temporal dependencies between frames. Not only the amount of calculation required and the memory consumption increase greatly, but also useful information is easily overwhelmed by a large number of irrelevant representations. In this work, we adopt a sliding window to batch the frames so that the message is passed between the adjacent frames in order to avoid interference with distant frames.

Different from [48], the self-attention layer of our temporal decoder is identical to the spatial encoder , the masked multi-head self-attention layers are removed. A sliding window of size runs over the sequence of spatial contextualized representations and the -th generated input batch is presented as:


where the window size and is the video length. The decoder consists of stacked identical self-attention layer similar as the encoder structure. Considering the first layer:


Regarding the first line in Eq. 6, same encoding is added to the relation representations in the same frame as queries and keys. The output from the last decoder layer is adopted for final prediction. Because of the sliding widow, the relationships in a frame have various representation in different batches. In this work, we choose the earliest representation appearing in the windows.

3.4 Loss Function

We employ multiple linear transformations to infer different kinds of relationships (such as attention, spatial, contacting) with the refined representations. In reality, the same type of relationship between two objects is not unique in semantics, such as synonymous actions person-holding-broom and person-touching-broom

. We introduce the multi-label margin loss function for predicate classification:


For a subject-object pair , are the annotated predicates while is the set of the predicates not in the annotation. indicates the computed confidence score of the -th predicate.

During training, the object distribution is computed by two fully-connected layers with a ReLU activation and a batch normalization in between. The standard cross entropy loss

is utilized. The total objective is formulated as:


3.5 Graph Generation Strategies

There are two typical strategies to generate a scene graph with the inferred relation distribution in previous works: (a) With Constraint only allows each subject-object pair to have at most one predicate while (b) No Constraint allows a subject-object pair to have multiple edges in the output graph with multiple guesses. With Constraint is more rigorous and indicates the ability of models to predict the most important relationships, but it is incompetent for the multi-label task. Although No Constraint can reflect the ability of multi-label prediction, but tolerant multiple guesses cause wrong information in the generated scene graph.

In order to make the generated scene graph closer to ground truth, we propose a new strategy named Semi Constraint allowing that a subject-object pair has multiple predicates such as person-holding-food and person-eating-food. The predicate is regarded as positive iff the corresponding relation confidence is higher than the threshold. The examples generated with different strategies are shown in Fig. 5.

At test time, the score of each relationship triplet subject-predicate-object is computed as:


where ,, are the confidence score of subject, predicate and object respectively.

4 Experiments

4.1 Dataset and Evaluation Metrics


We train and validate our model on the Action Genome (AG) dataset [20]

which provides frame-level scene graph labels and is built upon the Charades dataset

[41]. bounding boxes of 35 object classes (without ) and instances of 25 relationship classes are annotated for frames. These 25 relationships are subdivided into three different types: (1) attention relationships denoting whether a person is looking at an object, (2) spatial relationships and (3) contact relationships which indicate the different ways the object is contacted. In AG, subject-object pairs are labeled with multiple spatial relationships (door-in front of-person and door-on the side of-person) or contact relationships (person-eating-food and person-holding-food).

Evaluation Metrics

We follow three standard tasks from image-based scene graph generation [36] for evaluation : (1) predicate classification (PREDCLS): given ground truth labels and bounding boxes of objects, predict predicate labels of object pairs. (2) scene graph classification (SGCLS): classify the ground truth bounding boxes and predict relationship labels. (3) Scene graph detection (SGDET): detect the objects and predict relationship labels of object pairs. The object detection is regarded as successful if the predicted box overlaps with the ground-truth box at least 0.5 IoU. All tasks are evaluated with the widely used metrics () following With Constraint, Semi Constraint and No Constraint. The threshold of confidence in the relationship is set to in Semi Constraint for all experiments.

4.2 Technical Details

In this work, FasterRCNN [39] based on ResNet101 [17] is adopted as object detection backbone. We first train the detector on the training set of Action Genome [20]

and get 24.6 mAP at 0.5 IoU with COCO metrics. The detector is applied to all baselines for fair comparison. The parameters of the object detector including RPN are fixed when training scene graph generation models. Per-class non-maximal suppression at 0.4 IoU is applied to reduce region proposals provided by RPN.

We use an AdamW [35] optimizer with initial learning rate

and batch size 1 to train our model. Moreover, gradient clipping is applied with a maximal norm of 5. For all experiments on Action Genome, we set the window size

and for our STTran. The spatial encoder contains 1 layer while the temporal decoder contains 3 iterative layers. The self-attention module in both encoder and decoder has 8 heads with and . The 1936-d input is projected to 2048-d by the feed-forward network, then projected to 1936-d again after ReLU activation.

width=1 Method With Constraint No Constraint PredCLS SGCLS SGDET PredCLS SGCLS SGDET R@10 R@20 R@50 R@10 R@20 R@50 R@10 R@20 R@50 R@10 R@20 R@50 R@10 R@20 R@50 R@10 R@20 R@50 VRD[36] 51.7 54.7 54.7 32.4 33.3 33.3 19.2 24.5 26.0 59.6 78.5 99.2 39.2 49.8 52.6 19.1 28.8 40.5 Motif Freq[58] 62.4 65.1 65.1 40.8 41.9 41.9 23.7 31.4 33.3 73.4 92.4 99.6 50.4 60.6 64.2 22.8 34.3 46.4 MSDN[31] 65.5 68.5 68.5 43.9 45.1 45.1 24.1 32.4 34.5 74.9 92.7 99.0 51.2 61.8 65.0 23.1 34.7 46.5 VCTREE[45] 66.0 69.3 69.3 44.1 45.3 45.3 24.4 32.6 34.7 75.5 92.9 99.3 52.4 62.0 65.1 23.9 35.3 46.8 RelDN[59] 66.3 69.5 69.5 44.3 45.4 45.4 24.5 32.8 34.9 75.7 93.0 99.0 52.9 62.4 65.1 24.1 35.4 46.8 GPS-Net[34] 66.8 69.9 69.9 45.3 46.5 46.5 24.7 33.1 35.1 76.0 93.6 99.5 53.6 63.3 66.0 24.4 35.7 47.3 STTran 68.6 71.8 71.8 46.4 47.5 47.5 25.2 34.1 37.0 77.9 94.2 99.1 54.0 63.7 66.4 24.6 36.2 48.8

Table 1: Comparison with state-of-the-art image-based scene graph generation methods on Action Genome [20].The same object detector is used in all baselines for fair comparison. STTran has the best performance in all metrics. Note that the evaluation results of baselines are different from [20] since we adopted a more reasonable relationship output method, more details are provided in the supplementary material.

max width=0.48 Method Semi Constraint PredCLS SGCLS SGDET R@10 R@20 R@50 R@10 R@20 R@50 R@10 R@20 R@50 VRD[36] 55.5 64.9 65.2 36.2 39.7 40.1 19.0 27.1 32.4 Motif Freq[58] 65.7 74.1 74.5 45.5 49.3 49.5 22.9 33.7 39.0 MSDN[31] 69.6 78.9 79.9 48.3 54.1 54.5 23.2 34.2 41.5 VCTREE[45] 70.1 78.2 79.6 49.0 53.7 54.0 23.7 34.8 40.4 RelDN[59] 70.7 78.8 80.3 49.4 53.9 54.1 24.1 35.0 40.7 GPS-Net[34] 71.3 81.2 82.0 50.2 55.0 55.2 24.5 35.3 41.9 STTran 73.2 83.1 84.0 51.2 56.5 56.8 24.6 35.9 44.0

Table 2: Evaluation results of Semi Constraint which indicates the relationship between object pair is regarded as positive if the confidence score is higher than the threshold.

4.3 Quantitative Results and Comparison

Table 1 shows that our model outperforms state-of-the-art image-based methods in all metrics following With Constraint, Semi Constraint and No Constraint. For the fair comparison, all methods share the identical object detector which provides feature maps and region proposals of the same quality.

The bold numbers denote the best result in any column. With the help of temporal dependencies our model improves state-of-the-art (GPS-Net [34]) on PredCLS-, on SGCLS- and on SGDET- for the strategy With Constraint, which shows that STTran performs better than image-based baselines in predicting the most important relationships between an object pair. Our model also has excellent performance (see Table 2): on PredCLS-, on SGCLS- and improvement on SGDET- for Semi Constraint that allows multiple relationships between a subject-object pair. For No Constraint, STTran outperforms other methods in all settings except PredCLS-. Due to the small number of object pairs and the large number (50) of chances to guess, the results in this column are unstable and unconvincing. Motif Freq [58] which is very dependent on statistics achieves the highest score. However, the results become reliable with the less prediction number .

Note that there is no difference between PredCLS- and PredCLS- for With Constraint because of a limited number of object pairs and edge restriction. This also happens on SGCLS. Compared with PredCLS or SGCLS, the gap of SGDET between STTran and other methods is narrowed since the increased false object proposals cause interference, especially for Semi Constraint and No Constraint using small . Furthermore, the reproduced results of some methods are different from [20] since a more reasonable relationship output method is adopted and the object detectors are different 111More details are provided in the supplementary material..

(a) spatial encoder only
(b) complete STTran
Figure 3: Two relationship instances respectively generated by the spatial encoder and STTran. (a) Spatial encoder predicts the wrong relationship only with the spatial context in the second frame while (b) STTran can infer more accurate results with the help of temporal dependencies.

4.4 Temporal Dependency Analysis

Compared to the previous image-based scene graph generation, a dynamic scene graph has additional temporal dependencies that can be utilized. We discuss whether temporal dependencies can improve the relationship inference and validate that our proposed method utilize temporal dependencies. In this subsection, we measure PredCLS- (With Constraint) as the performance indicator that shows the ability of single relationship classification strictly.

Is temporal dependence easy to use?

Spatial context plays an important role in scene graph generation as validated by several image-based methods [58, 34]. To explore the effectiveness of temporal dependencies, we graft the widely-used recurrent network, LSTM onto the baselines in Table 3 as follows. Before forwarding the feature vectors into the final classifiers, the entire vectors representing relationships in the video are organized as a sequence and processed by LSTM.

Table 3 shows all baselines can gain more or less from the temporal dependencies. For Motif Freq [58], PredCLS-

increases from 65.1% to 65.2% slightly probably due to the relatively simple feature representation. Meanwhile, the score of GPS-Net

[34] is improved from 69.9% to 70.4% significantly. The experiment shows that temporal dependencies are helpful for scene graph generation. However, the previous methods were designed for static images. This is why we propose Spatial-Temporal Transformer (STTran) to make better use of temporal dependencies.

max width=0.25 Method PredCLS-R@20 original +LSTM Motif Freq[58] 65.1 65.2 MSDN[31] 68.5 68.8 RelDN[59] 69.5 69.7 GPS-Net[34] 69.9 70.4

Table 3: We integrate LSTMs to process the relationship features before forwarding them into the classifier into some representative baselines. All baselines are improved with temporal dependencies but worse than our STTran.
Figure 4: Qualitative results for dynamic scene graph generation. The scene graphs from STTran are generated with the top-10 confident relationship predictions with different Strategies. The green box is the undetected ground truth. The melon and gray colors indicate true positive and false positive respectively. Correct relationships are colored with light blue for clarity and is omitted for brevity. It shows poor object detection results reduce the performance and the result of Semi Constraint is closer to the ground truth.

Can STTran really understand temporal dependencies?

In order to verify that STTran really improves performance through temporal dependencies in the video, instead of using clearer feature representation or powerful multi-head attention module, we trained our model with the processed training set and show the results in Table 4.

We randomly sample videos in the training set and shuffle/reverse them. Meanwhile, the test set remains unchanged. As shown in Table 4, PredCLS- (With Constraint) drops significantly from 71.8% to 71.0%, when one-third of the training videos are reversed, which is equivalent to adding noise in the temporal information. Moreover, shuffled videos indicate the temporal information is completely broken and the noise is further amplified. The experimental result (first row) is in line with expectations: PredCLS- drops to 70.6%. The experiments demonstrate where the improvement comes from and validate that the temporal dependencies are learned in STTran.

max width=0.48 Normal Video Processed Video Processing PredCLS-R@20 shuffle 70.6 reverse 71.0 - - 71.8

Table 4: We shuffle/reverse one-third of the videos in the training set to explore the sensitivity of the model to frame sequence. By disorganizing the temporal information via shuffling or reversing the video sequence, the performance of the model degrades accordingly as expected.

4.5 Ablation Study

In our Spatial-Temporal Transformer, two modules are proposed, a Spatial Encoder and Temporal Decoder. Furthermore, we integrate the temporal position into the relationship representations with the frame encoding in the Temporal Decoder. In order to clarify how these modules contribute to the performance, we ablate different components and present the results in Table 5. We adopt PredCLS- and SGDET- as the metrics with With Constraint and Semi Constraint. PredCLS shows the ability of relationship prediction intuitively while SGDET indicates the performance of scene graph generation.

When only the spatial encoder is enabled, the model works the same as the image-based method and also has a similar performance as RelDN [59]. The isolated temporal decoder (second row) boosts the performance significantly with the additional information from other frames. PredCLS- is improved slightly when the encoder and decoder both work whereas the improvement of SGDET- is limited by the object detection backbone. The learned frame encoding helps STTran fully understand the temporal dependencies and has a strong, positive effect both on PredCLS- and SGDET- while the fixed sinusoidal encoding performs unsatisfactorily. Two instances respectively predicted by the spatial encoder only and the complete STTran are shown in Fig. 3. Without temporal dependencies, the spatial encoder mistakenly predicts person-eating-food as person-touching-food in the second frame whereas STTran infers the relationship correctly. This explicitly proves that STTran can utilize temporal context to improve scene graph generation.

max width=0.48 Spatial Temporal Frame PredCLS-R@20 SGDET-R@20 Encoder Decoder Encoding With Semi With Semi - - 69.6 78.7 32.9 35.1 - - 71.0 82.2 33.7 35.5 - 71.3 82.7 33.8 35.6 sinusoidal 71.3 82.8 33.9 35.7 learned 71.8 83.1 34.1 35.9

Table 5: Ablation Study on our STTran. indicates the corresponding module is enabled while indicates disabled. We also compare the effectiveness of sinusoidal and learned positional encoding.
Figure 5: Qualitative instances in the PredCLS setting following different strategies. Light blue relationships are true positives predicted by STTran at the setting while gray are false positives. The graph from Semi Constraint is identical to the ground truth, whereas there are several false positives in the graph from No Constraint without restriction.

4.6 Qualitative Results

Fig. 4 shows the qualitative results for dynamic scene graph generation. The five columns from left to right are RGB frame, scene graph generated by ground truth, scene graph generated with the top-10 confident relationship predictions with the Strategies With Constraint, Semi Constraint and No Constraint. The melon color indicates truth positive whereas gray indicates false positive. The green box is the ground truth not detected by the detector. In the first row, two false positives with high object detection confidence (medicine and notebook) result in wrong predictions among the top-10 relationships. All the top-10 confident relationships following three strategies are of high quality in the second row when the object detection is successful. person-drinking from-bottle in the third column is lost because With Constraint only allows at most one relationship between each subject-object pair for each type of relationship while person-not contacting-bottle replaces the attention relationship between and in the top-10 confident list when using No Constraint. The two frames in Fig. 4 are not adjacent since the detected s overlap with the ground truth IoU in the frames between them.

Another example is shown in Fig. 5, which illustrates different performance of 3 generation strategies. For With Constraint, is abandoned since only one contact relationship is allowed between each object pair. Although No Constraint allows multi-label prediction, the result contains a lot of noise when there are few pairs in the frame, especially bounding boxes are given in PredCLS and SGCLS.

5 Conclusion

In this paper, we propose Spatial-Temporal Transformer (STTran) for dynamic scene graph generation whose encoder extracts spatial context within a frame and decoder captures the temporal dependencies between frames. Distinct from using single-label losses as proposed in previous works, we utilize a multi-label margin loss and introduce a new strategy to generate scene graphs. Several experiments demonstrate that temporal context has a positive effect on relationship prediction. We obtain state-of-the-art results for the dynamic scene graph generation task on the Action Genome dataset.


  • [1] P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In European conference on computer vision, pp. 382–398. Cited by: §2.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pp. 6077–6086. Cited by: §2.
  • [3] O. Ashual and L. Wolf (2019) Specifying object attributes and relations in interactive scene generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4561–4569. Cited by: §1.
  • [4] F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles (2015) Activitynet: a large-scale video benchmark for human activity understanding. In

    Proceedings of the ieee conference on computer vision and pattern recognition

    pp. 961–970. Cited by: §1.
  • [5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §2, §3.3.
  • [6] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308. Cited by: §2, §2.
  • [7] V. S. Chen, P. Varma, R. Krishna, M. Bernstein, C. Re, and L. Fei-Fei (2019) Scene graph prediction with limited labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2580–2590. Cited by: §2.
  • [8] Y. Cong, H. Ackermann, W. Liao, M. Y. Yang, and B. Rosenhahn (2020) NODIS: neural ordinary differential scene understanding. In ECCV, pp. 636–653. Cited by: §2.
  • [9] B. Dai, Y. Zhang, and D. Lin (2017) Detecting visual relationships with deep relational networks. In CVPR, pp. 3076–3086. Cited by: §2.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. Cited by: §2.
  • [11] C. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In CVPR, pp. 2625–2634. Cited by: §2.
  • [12] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2021) An image is worth 16x16 words: transformers for image recognition at scale. In ICLR, Cited by: §2.
  • [13] L. Gao, B. Wang, and W. Wang (2018) Image captioning with scene-graph based semantic concepts. In

    Proceedings of the 2018 10th International Conference on Machine Learning and Computing

    pp. 225–229. Cited by: §1.
  • [14] R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2019)

    Video action transformer network

    In CVPR, pp. 244–253. Cited by: §2.
  • [15] G. Gkioxari, R. Girshick, P. Dollár, and K. He (2018) Detecting and recognizing human-object interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8359–8367. Cited by: §1.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.2.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
  • [18] S. He, W. Liao, H. R. Tavakoli, M. Yang, B. Rosenhahn, and N. Pugeault (2020) Image captioning through image transformer. In ACCV, Cited by: §2.
  • [19] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [20] J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles (2020) Action genome: actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236–10247. Cited by: §1, §1, §2, §4.1, §4.2, §4.3, Table 1.
  • [21] S. Ji, W. Xu, M. Yang, and K. Yu (2012)

    3D convolutional neural networks for human action recognition

    TPAMI 35 (1), pp. 221–231. Cited by: §2.
  • [22] J. Johnson, A. Gupta, and L. Fei-Fei (2018) Image generation from scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1219–1228. Cited by: §1, §2.
  • [23] J. Johnson, B. Hariharan, L. Van Der Maaten, J. Hoffman, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2989–2998. Cited by: §1, §2.
  • [24] J. Johnson, R. Krishna, M. Stark, L. Li, D. Shamma, M. Bernstein, and L. Fei-Fei (2015) Image retrieval using scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3668–3678. Cited by: §1, §2.
  • [25] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §1.
  • [26] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International journal of computer vision 123 (1), pp. 32–73. Cited by: §1, §2.
  • [27] C. A. Kurby and J. M. Zacks (2008) Segmentation in the perception and memory of events. Trends in cognitive sciences 12 (2), pp. 72–79. Cited by: §1.
  • [28] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. (2020) The open images dataset v4. IJCV, pp. 1–26. Cited by: §2.
  • [29] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2020) What does bert with vision look at?. In ACL, pp. 5265–5275. Cited by: §2.
  • [30] Y. Li, W. Ouyang, B. Zhou, J. Shi, C. Zhang, and X. Wang (2018) Factorizable net: an efficient subgraph-based framework for scene graph generation. In ECCV, pp. 335–351. Cited by: §2.
  • [31] Y. Li, W. Ouyang, B. Zhou, K. Wang, and X. Wang (2017) Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1261–1270. Cited by: §2, Table 1, Table 2, Table 3.
  • [32] X. Liang, L. Lee, and E. P. Xing (2017) Deep variation-structured reinforcement learning for visual relationship and attribute detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 848–857. Cited by: §2.
  • [33] W. Liao, B. Rosenhahn, L. Shuai, and M. Ying Yang (2019) Natural language guided visual relationship detection. In CVPRW, pp. 0–0. Cited by: §2.
  • [34] X. Lin, C. Ding, J. Zeng, and D. Tao (2020) Gps-net: graph property sensing network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3746–3753. Cited by: §2, §4.3, §4.4, §4.4, Table 1, Table 2, Table 3.
  • [35] I. Loshchilov and F. Hutter (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §4.2.
  • [36] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In European conference on computer vision, pp. 852–869. Cited by: §2, §4.1, Table 1, Table 2.
  • [37] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Vol. 32. Cited by: §2.
  • [38] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §2.
  • [39] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §2, §2, §3.2, §4.2.
  • [40] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, and C. D. Manning (2015) Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80. Cited by: §1, §2.
  • [41] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pp. 510–526. Cited by: §1, §4.1.
  • [42] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. pp. 568–576. Cited by: §2.
  • [43] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §2.
  • [44] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-bert: pre-training of generic visual-linguistic representations. In ICLR, Cited by: §2.
  • [45] K. Tang, H. Zhang, B. Wu, W. Luo, and W. Liu (2019) Learning to compose dynamic tree structures for visual contexts. In CVPR, pp. 6619–6628. Cited by: §2, Table 1, Table 2.
  • [46] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2020) Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv:2012.12877. Cited by: §2.
  • [47] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In ICCV, pp. 4489–4497. Cited by: §2.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §2, §3.1, §3.3, §3.3, §3.3.
  • [49] W. Wang, R. Wang, S. Shan, and X. Chen (2019) Exploring context and visual pattern of relationship for scene graph generation. In CVPR, pp. 8188–8197. Cited by: §2.
  • [50] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §2.
  • [51] Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia (2020) End-to-end video instance segmentation with transformers. arXiv preprint arXiv:2011.14503. Cited by: §2.
  • [52] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In CVPR, pp. 5410–5419. Cited by: §2.
  • [53] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.
  • [54] J. Yang, J. Lu, S. Lee, D. Batra, and D. Parikh (2018) Graph r-cnn for scene graph generation. In ECCV, pp. 670–685. Cited by: §2.
  • [55] X. Yang, K. Tang, H. Zhang, and J. Cai (2019) Auto-encoding scene graphs for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10685–10694. Cited by: §1, §2.
  • [56] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura (2020) Bert representations for video question answering. In WACV, pp. 1556–1565. Cited by: §2.
  • [57] R. Yu, A. Li, V. I. Morariu, and L. S. Davis (2017) Visual relationship detection with internal and external linguistic knowledge distillation. In ICCV, pp. 1974–1982. Cited by: §2.
  • [58] R. Zellers, M. Yatskar, S. Thomson, and Y. Choi (2018) Neural motifs: scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5831–5840. Cited by: §2, §4.3, §4.4, §4.4, Table 1, Table 2, Table 3.
  • [59] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro (2019) Graphical contrastive losses for scene graph parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11535–11543. Cited by: §2, §4.5, Table 1, Table 2, Table 3.
  • [60] H. Zhao, A. Torralba, L. Torresani, and Z. Yan (2019) Hacs: human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668–8678. Cited by: §1.