Grounded Objects and Interactions for Video Captioning

11/16/2017 ∙ by Chih-Yao Ma, et al. ∙ 0

We address the problem of video captioning by grounding language generation on object interactions in the video. Existing work mostly focuses on overall scene understanding with often limited or no emphasis on object interactions to address the problem of video understanding. In this paper, we propose SINet-Caption that learns to generate captions grounded over higher-order interactions between arbitrary groups of objects for fine-grained video understanding. We discuss the challenges and benefits of such an approach. We further demonstrate state-of-the-art results on the ActivityNet Captions dataset using our model, SINet-Caption based on this approach.



There are no comments yet.


page 2

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video understanding using natural language processing to generate captions for videos has been regarded as a crucial step towards machine intelligence. However, like other video understanding tasks addressed using deep learning, initial work on video captioning has focused on learning compact representations combined over time that is used as an input to a decoder, e.g. LSTM, at the beginning or at each word generation stage to generate a caption for the target video 

Pan et al. (2016a); Venugopalan et al. (2015, 2014). This has been improved by using spatial and temporal attention mechanisms  Song et al. (2017); Yao et al. (2015); Yu et al. (2016) and/or semantics attribute methods Gan et al. (2016); Pan et al. (2016b); Shen et al. (2017); Yu et al. (2017). These methods do not ground their predictions on object relationships and interactions, i.e. they largely focus on overall scene understanding or certain aspects of the video at best. However, modeling visual relationships and object interactions in a scene is a crucial form of video understanding as shown in Figure 1.

There has been considerable work that detect visual relationships in images using separate branches in a ConvNet to explicitly model object, human, and their interactions Chao et al. (2017); Gkioxari et al. (2017), using scene graphs Johnson et al. (2015); Li et al. (2017); Liang et al. (2017); Xu et al. (2017) and by pairing different objects in the scene Dai et al. (2017a); Hu et al. (2016); Santoro et al. (2017); Zhang et al. (2017a). While these models can successfully detect visual relationships for images, these methods are intractable in the video domain making it inefficient if not impossible to detect all relationships across all individual object pairs Zhang et al. (2017b). As a result, past work has at best focused on pairwise relationships on limited datasets Lea et al. (2016); Ni et al. (2014).

In this paper, we first hypothesize using these interactions as basis for caption generation. We then describe our method to achieve this goal by using an Region Proposal Network (RPN) to extract ROIs from videos and learning their interactions efficiently using dot product attention. Our model, SINet-Caption efficiently explores and grounds caption generation over interactions between arbitrary subgroups of objects, the members of which are determined by a learned attention mechanism. In our results, we demonstrate how to exploit both overall image context (coarse-grained) and higher-order object interactions (fine-grained) in the spatiotemporal domain for video captioning, as illustrated in Figure 2. We obtain state-of-the-art results on video captioning over the challenging ActivityNet Captions dataset.

Figure 1: Video captions are composed of multiple visual relationships and interactions. We detect higher-order object interactions and use them as basis for video captioning.
Figure 2: We exploit both coarse- (overall image) and fine-grained (object) visual representations for each video frame.

2 Video captioning model

The proposed SINet-Caption

first attentively models object inter-relationships and discovers the higher-order interactions for a video. The detected higher-order object interactions (fine-grained) and overall image representation (coarse-grained) are then temporally attended as visual cue for each word generation.

2.1 Higher-order object interactions

Problem statement: We define objects to be a certain region in the scene that might be used to determine the visual relationships and interactions. Each object representation can be directly obtained from an RPN and further encoded into an object feature, as shown in Figure 2. Note that we do not encode class information from the object detector since there exists cross-domain problem, and we may miss some objects that are not detected by pre-trained object detector. Also, we do not know the corresponding object across time since linking objects through time may not be suitable if the video sequence is long and computationally expensive. Our objective is to efficiently detect higher-order interactions—interactions beyond pairwise objects—from these rich yet unordered object representations reside in a high-dimensional space that spans across time.

In the simplest setting, an interaction between objects in the scene can be represented as combining individual object information. For example, one method is to add the learnable representations and project these representations into a high-dimensional space where the object interactions can be exploited by simply summing up the object representations Santoro et al. (2017). Another approach which has been widely used with images is pairing all possible object candidates (or subject-object pairs) Chao et al. (2017); Dai et al. (2017a); Hu et al. (2016); Zhang et al. (2017a). However, this is infeasible for video, since a video typically contains hundreds of frame and the set of object-object pairs is too large to fully represent. Detecting object relationships frame by frame is computationally expensive, and the temporal reasoning of object interactions is not used.

Recurrent Higher-Order Interaction Module: To overcome these issues, we propose a generic recurrent module for detecting higher-order object interactions for fine-grained video understanding problem, as illustrated in Figure 3 (right). The proposed recurrent module dynamically selects object candidates which are important to describe video content. The combinations of these selected objects are then concatenated to model higher-order interaction using group to group or triplet groups of objects.

Formally, the current image representation and previous object interaction representation are first projected to introduce learnable weights. The projected and are then repeated and expanded times (the number of objects at time ). We directly combine this information with projected objects via matrix addition and use it as input to dot-product attention. In short, the attention is computed using inputs from current (projected) object features, overall image visual representation, and previously discovered object interactions, which provide the attention mechanism with maximum context. Specifically, the input to the attention can be define as: , and the attention weights over all objects can be define as: , where and are learned weights for and .

is a Multi-Layer Perceptron (MLP) with parameter

, is the dimension of last fully-connected layer of , is the input to th attention module, and is a scaling factor. We omit the bias term for simplicity.

The attended object feature at time is then calculated as mean-pooling on weighted objects: . The output

is a single feature vector representation which encodes the

th object inter-relationships of a video frame at time , and ranges from representing the number of groups for inter-relationships.

Finally, the selected object candidates are then concatenated and used as the input to a LSTM cell: . The output is then defined as the higher-order object interaction representation at current time . The sequence of hidden state of the LSTM cell are the representations of higher-order object interactions for each timestep.

Figure 3: Overview of the proposed SINet-Caption for video captioning.

2.2 Video captioning with coarse- and fine-grained

We now describe how our SINet-Caption exploit both coarse-grained (overall image representation) and fine-grained (higher-order object interactions) for video captioning.

The SINet-Caption is inspired by prior work using hierarchical LSTM for captioning task Anderson et al. (2017); Song et al. (2017), and we extend it with the proposed recurrent higher-order interaction module so that the model can leverage the detected higher-order object interactions, as shown in Figure 3. The model consist of two LSTM layers: Attention LSTM and Language LSTM.

Attention LSTM: The Attention LSTM fuses the previous hidden state output of Language LSTM , overall representation of the video, and the input word at time

to generate the hidden representation for the following attention module. Formally, the input to Attention LSTM can be defined as:

, where is the projected and mean-pooled image features, is a MLP with parameter , is a word embedding matrix for a vocabulary of size , and

is one-hot encoding of the input word at time

. Note that is the video time, and is the timestep for caption generation.

Temporal attention: The input for computing the temporal attention is the combination of output of the Attention LSTM and projected image features , and the attention is computed using a simple tanh function and a fully-connected layer to attend to projected image features , as illustrated in Figure 3. Specifically, the input can be defined as , and the temporal attention can be obtained by , where and are learned weights for and . is the dimension of last fully-connected layer of .

Co-attention: We directly apply the temporal attention obtained from image features on object interaction representations (see Sec 2.1 for details).

Language LSTM: Finally, the Language LSTM takes the concatenation of output of the Attention LSTM , attended video representation , and co-attended object interactions as input:

. The output of Language LSTM is then used to generate words via fully-connected and softmax layer.

Method B@1 B@2 B@3 B@4 R M C
Test set
LSTM-YT Venugopalan et al. (2014) (C3D) 18.22 7.43 3.24 1.24 - 6.56 14.86
S2VT Venugopalan et al. (2015) (C3D) 20.35 8.99 4.60 2.62 - 7.85 20.97
H-RNN Yu et al. (2016) (C3D) 19.46 8.78 4.34 2.53 - 8.02 20.18
S2VT + full context Krishna et al. (2017) (C3D) 26.45 13.48 7.21 3.98 - 9.46 24.56
LSTM-A + policy gradient + retrieval Yao et al. (2017) (ResNet + P3D ResNet Qiu et al. (2017)) - - - - - 12.84 -
Validation set (Avg. 1st and 2nd)
LSTM-A (ResNet + P3D ResNet) Yao et al. (2017) - - - 3.38 13.27 7.71 16.08
LSTM-A + policy gradient + retrieval Yao et al. (2017) (ResNet + P3D ResNet Qiu et al. (2017)) - - - 3.13 14.29 8.73 14.75
SINet-Caption | img (C3D) 17.18 7.99 3.53 1.47 18.78 8.44 38.22
SINet-Caption | img (ResNeXt) 18.81 9.31 4.27 1.84 20.46 9.56 43.12
SINet-Caption | obj (ResNeXt) 19.07 9.48 4.38 1.92 20.67 9.56 44.02
SINet-Caption | img+obj | no co-attn (ResNeXt) 19.93 9.82 4.52 2.03 21.08 9.79 44.81
SINet-Caption | img+obj (ResNeXt) 19.78 9.89 4.52 1.98 21.25 9.84 44.84
Table 1: METEOR Banerjee and Lavie (2005), ROUGE-L Lin (2004), CIDEr-D Vedantam et al. (2015), and BLEU@N Papineni et al. (2002) scores on the ActivityNet Captions test and validation set. All methods use ground truth proposal except LSTM-A Yao et al. (2017). Our results with ResNeXt spatial features use videos sampled at maximum 1 FPS only.

3 Evaluation on ActivityNet Captions

We use the ActivityNet Captions dataset for evaluating SINet-Caption. The ground truth temporal proposals are used to segment videos, i.e. we treat each video segment independently since our focus in this work is modeling object interactions for video captioning rather than on temporal proposals (please refer to supplementary material for details on dataset and implementation). All methods in Table 1 use ground truth temporal proposal, except LSTM-A Yao et al. (2017).

For a fair comparison with prior methods that use C3D features, we report results with both C3D and ResNeXt spatial features. Since there is no prior result reported on the validation set, we compare our method via LSTM-A Yao et al. (2017) which reports results on the validation and test sets. This allows us to indirectly compare with methods reported on the test set. As shown in Table 1, while LSTM-A clearly outperforms other methods on the test set with a large margin, our method shows better results on the validation sets across nearly all language metrics. Note that we do not claim our method to be superior to LSTM-A

because of two fundamental differences. First, they do not rely on ground truth temporal proposals. Second, they use features extracted from a ResNet fine-tuned on Kinetics and another P3D ResNet 

Qiu et al. (2017)

fine-tuned on Sports-1M, whereas we only use an ResNeXt-101 fine-tuned on Kinetics sampled at maximum 1 FPS. Utilizing more powerful feature representations can improve the prediction accuracy by a large margin on video captioning tasks. This also corresponds to our experiments with C3D and ResNeXt features, where the proposed method with ResNeXt features performs significantly better than C3D features. We observed that using only the detected object interactions shows slightly better performance than using only overall image representation. This demonstrates that even though the SINet-Caption is not aware of the overall scene representation, it achieves similar performance relying on only the detected object interactions. By combining both, the performance further improved across all evaluation metrics, with or without co-attention.

In conclusion, we introduce a generic recurrent module to detect higher-order object interactions for video understanding task. We demonstrate on ActivityNet Captions that the proposed SINet-Caption exploiting both higher-order object interactions (fine-grained) and overall image representation (coarse-grained) outperforms prior work by a substantial margin.

Supplementary Materials

Dataset and Implementation

ActivityNet Captions dataset: We use the new ActivityNet Captions for evaluating SINet-Caption. The ActivityNet Captions dataset contains 20k videos and has total 849 video hours with 100k total descriptions. We focus on providing the fine-grained understanding of the video to describe video events with natural language, as opposed to identifying the temporal proposals. We thus use the ground truth temporal segments and treat each temporal segment independently. We use this dataset over other captioning datasets because ActivityNet Captions is action-centric, as opposed to object-centric Krishna et al. (2017). This fits our goal of detecting higher-order object interactions for understanding human actions. Following the same procedure as in Krishna et al. (2017), all sentences are capped to be a maximum length of 30 words. We sample predictions using beam search of size 5 for captioning. While the previous work sample C3D features every 8 frames Krishna et al. (2017), we only sampled video at maximum 1 FPS. Video segments longer than 30 seconds are evenly sampled at maximum 30 samples.

Image feature: We pre-trained a ResNeXt-101 on the Kinetics dataset Kay et al. (2017)

sampled at 1 FPS (approximately 2.5 million images), and use it to extract image features. We use SGD with Nesterov momentum as the optimizer. The initial learning rate is

and automatically drops by 10x when validation loss saturated for 5 epochs. The weight decay is

and the momentum is 0.9, and the batch size is 128. We follow the standard data augmentation procedure by randomly cropping and horizontally flipping the video frames during training. When extracting image features, the smaller edge of the image is scaled to 256 pixels and we crop the center of the image as input to the fine-tuned ResNeXt-101. Each image feature is encoded to a 2048-d feature vector.

Object feature: We generate the object features by first obtaining the coordinates of ROIs from a Deformable R-FCN Dai et al. (2017b) with ResNet-101 He et al. (2015) as the backbone architecture. The Deformable R-FCN was trained on MS-COCO train and validation dataset Lin et al. (2014)

. We set the IoU threshold for NMS to be 0.2. For each of the ROIs, we extract features using ROI coordinates and adaptive max-pooling from the same ResNeXt-101 model pre-trained on Kinetics. The resulting object feature for each ROI is a 2048-d feature vector. ROIs are ranked according to their ROI scores. We select top 15 objects only. We do not use object class information since there is a cross-domain problem and we may miss some objects that were not detected. For the same reason, the bounding-box regression process is not performed here since we do not have the ground-truth bounding boxes.

Training: We train the proposed SINet-Caption with ADAM optimizer. The initial learning rate is set to and automatically drops by 10x when validation loss is saturated. The batch sizes is .

ActivityNet Captions on 1st and 2nd validation set

We report the performance of SINet-Caption on both 1st and 2nd validation set in Table 2.

Method B@1 B@2 B@3 B@4 R M C
1st Validation set
SINet-Caption | img (C3D) 16.93 7.91 3.53 1.58 18.81 8.46 36.37
SINet-Caption | img (ResNeXt) 18.71 9.21 4.25 2.00 20.42 9.55 41.18
SINet-Caption | obj (ResNeXt) 19.00 9.42 4.29 2.03 20.61 9.50 42.20
SINet-Caption | img+obj | no co-attn (ResNeXt) 19.89 9.76 4.48 2.15 21.00 9.62 43.24
SINet-Caption | img+obj (ResNeXt) 19.63 9.87 4.52 2.17 21.22 9.73 44.14
2nd Validation set
SINet-Caption | img (C3D) 17.42 8.07 3.53 1.35 18.75 8.41 40.06
SINet-Caption | img (ResNeXt) 18.91 9.41 4.28 1.68 20.49 9.56 45.05
SINet-Caption | obj (ResNeXt) 19.14 9.53 4.47 1.81 20.73 9.61 45.84
SINet-Caption | img+obj | no co-attn (ResNeXt) 19.97 9.88 4.55 1.90 21.15 9.96 46.37
SINet-Caption | img+obj (ResNeXt) 19.92 9.90 4.52 1.79 21.28 9.95 45.54
Table 2: METEOR, ROUGE-L, CIDEr-D, and BLEU@N scores on the ActivityNet Captions 1st and 2nd validation set. All methods use ground truth temporal proposal, and out results are evaluated using the code provided in Krishna et al. (2017) with . Our results with ResNeXt spatial features use videos sampled at maximum 1 FPS only.

Qualitative Analysis

To further validate the proposed method, we qualitatively show how the SINet-Caption selectively attends to various video frames (temporal), objects, and object relationships (spatial) during each word generation. Several examples are shown in Figure 4, 5, 6, and 7. Note that each word generation step, the SINet-Caption uses the weighted sum of the video frame representations and the weighted sum of object interactions at corresponding timesteps (co-attention). Also, since we aggregate the detected object interactions via LSTM cell through time, the feature representation of the object interactions at each timestep can be seen as a fusion of interactions at the present and past time. Thus, if temporal attention has highest weight on , it may actually attend to the interaction aggregated from to . Nonetheless, we only show the video frame with highest temporal attention for convenience. We use redred and blueblue to represent the two selected sets of objects ().

In each of the figures, the video frames (with maximum temporal attention) at different timesteps are shown along with each word generation. All ROIs in the top or bottom images are weighted with their attention weights. In the top image, ROIs with weighted bounding box edges are shown, whereas, in the bottom image, we set the transparent ratio equal to the weight of each ROI. The brighter the region is, the more important the ROI is. Therefore, less important ROIs (with smaller attention weights) will disappear in the top image and be completely black in the bottom image. When generating a word, we traverse the selection of beam search at each timestep.

Figure 4: The man is then shown on the water skiing. We can see that the proposed SINet-Caption often focus on the person and the wakeboard, and most importantly it highlight the interaction between the two, i.e. the person steps on the wakeboard.
Figure 5: Two people are seen playing a game of racquetball. The SINet-Caption is able to identify that two persons are playing the racquetball and highlight the corresponding ROIs in the scene.
Figure 6: People are surfing on the water. At the beginning, the SINet-Caption identify multiple people are in the ocean. The person who is surfing on the water is then successfully identified, and the rest of irrelevant objects and background are completely ignored.
Figure 7: A man is sitting on a camel. The SINet-Caption is able to detect the ROIs containing both persons and the camel. We can also observe that it highlights both the ROIs for persons who sit on the camel and the camel itself at frame 3 and 9. However, the proposed method failed to identify that there are multiple people sitting on two camels. Furthermore, in some cases, it selects the person who leads the camels. This seems to be because the same video is also annotated with another caption focusing on that particular person: A short person that is leading the camels turns around.