Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

by   Zhiyuan Fang, et al.
Arizona State University

Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene. These changes can be observable, such as movements, manipulations, and transformations of the objects in the scene – these are reflected in conventional video captioning. However, unlike images, actions in videos are also inherently linked to social and commonsense aspects such as intentions (why the action is taking place), attributes (such as who is doing the action, on whom, where, using what etc.) and effects (how the world changes due to the action, the effect of the action on other agents). Thus for video understanding, such as when captioning videos or when answering question about videos, one must have an understanding of these commonsense aspects. We present the first work on generating commonsense captions directly from videos, in order to describe latent aspects such as intentions, attributes, and effects. We present a new dataset "Video-to-Commonsense (V2C)" that contains 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. We finetune our commonsense generation models on the V2C-QA task where we ask questions about the latent aspects in the video. Both the generation task and the QA task can be used to enrich video captions.



There are no comments yet.


page 1

page 18

page 19

page 21

page 26

page 27


Hybrid Reasoning Network for Video-based Commonsense Captioning

The task of video-based commonsense captioning aims to generate event-wi...

Generating Natural Questions About an Image

There has been an explosion of work in the vision & language community d...

NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions

We introduce NExT-QA, a rigorously designed video question answering (Vi...

How to Make a BLT Sandwich? Learning to Reason towards Understanding Web Instructional Videos

Understanding web instructional videos is an essential branch of video u...

Video Caption Dataset for Describing Human Actions in Japanese

In recent years, automatic video caption generation has attracted consid...

TennisVid2Text: Fine-grained Descriptions for Domain Specific Videos

Automatically describing videos has ever been fascinating. In this work,...

Lifelong Learning for Image Captioning by Asking Natural Language Questions

In order to bring artificial agents into our lives, we will need to go b...

Code Repositories


Video captioning baseline models on Video2Commonsense Dataset.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When humans watch videos they can typically understand and reason about various aspects of the scene beyond the visible objects and actions. This involves understanding that some objects are active agents who not only perform actions and manipulate objects, but are motivated by intentions, have pre-conditions, and their actions have an effect on their mental states. For instance consider the example video in Figure 1. In analyzing this video clip, humans employ various capabilities such as perception, reasoning, inference, and speculation, to come up with a description for the observable sequence of events, but also reason about latent aspects such as the intention of the group of runners “to win the medal”, the effect of being “congratulated at the finish line”, and the attribute “athletic”.

The above example also illustrates that recognition of objects, actions, and events is often not enough; understanding the causal relationships, social interactions, and commonsense aspects behind them provides more context and a much more semantic interpretation of the video [gupta2009understanding]. A model that can provide such detailed interpretations facilitates answering inferential questions such as “Will the player get angry later?”. However, existing visual understanding systems are unable to perform such tasks that require reasoning and inference [marcus2018deep]. Inspired by this, we reckon that a critical yet missing element in complex video understanding is the capability of performing commonsense inference, especially a generative model. Existing efforts seek to find the textual explanations or intentions of human activities as a classification task [vondrick2016predicting] or a vision-to-text alignment problem [zhu2015aligning]. In this paper we propose the Video to Commonsense (V2C) framework to generate commonsense descriptions about the underlying event in the video, enriching the factual description provided by a caption. Our generative model is more grounded and expressive due to the diverse and rich commonsense that it generates, and can also be used as a building block for downstream tasks such as video question answering.

The V2C task requires a system to generate captions as well as three types of commonsense descriptions (intention, effect, attribute) directly from an input video. Our model – the “V2C-Transformer” utilizes an encoder-decoder design, with a video encoder that extracts global representations of the video, and a transformer decoder that generates captions and commonsense descriptions. Recent efforts [bosselut2019comet] indicate that transformer decoders are applicable for textual commonsense inference and completion task. Inspired by this, we use a cross-modal self-attention module that exploits the joint visual-textual embeddings.

For training our models, we curate the V2C dataset. We adopt the Msr-vtt video description dataset as a source of videos and captions. We first utilize the Atomic [sap2018atomic] machine commonsense dataset to get a list of candidate commonsense texts (intentions, effects, and attributes), and rank these using a BERT-based [devlin2018bert] model. These candidates are retrieved without using the video and therefore may not be accurate. So we instruct human annotators (Amazon Mechanical Turker) to annotate these videos or to select, remove, or rewrite the texts retrieved from Atomic. The text retrieved by ATOMIC helps our human annotators to understand the format of desired annotations, and also gives them a list of suggestions. The human component in our annotation procedure makes our data natural, relevant, and closer to human understanding of video events.

We additionally explore the use of our V2C-Transformer architecture for a open-ended video question answering task, where the questions are about commonsense aspects from the video. For this, we create a QA addendum of the V2C dataset called V2C-QA. By asking questions about the latent aspects in the video, our models are able to enrich caption generation with three specific types of commonsense knowledge.

To summarize our contributions:

  1. We formulate the novel and challenging “V2C” task for enriching video captioning by generating descriptions of commonsense aspects.

  2. We curate a new video dataset annotated with captions and commonsense descriptions for the purpose of training models.

  3. We present our V2C-Transformer architecture that effectively generates relevant commonsense descriptions, and serves as a strong baseline.

  4. We pose V2C as a question-answering task and show that asking specific questions about the video can be used for commonsense caption generation.

2 Video to Commonsense (V2C)

2.1 Problem Formulation

Consider a video consisting of frames described by sentence . Traditional video captioning models are formulated as a visual-to-text translation function . Our Video-to-Commonsense (V2C) framework can be used for generating commonsense descriptions under two-different settings. In the first setting (V2C-Completion), we use ground-truth captions to guide commonsense-enriched caption generation, in short Commonsense Generation. Under this setting this task can be viewed as providing supplementary explanations to the caption. In the second setting (V2C-Generation), we first learn to generate captions from videos and then use them to generate commonsense descriptions. Under this setting we learn to generate commonsense-enriched captions.

V2C-Completion (1)
V2C-Generation (2)

2.2 V2C-Transformer

Figure 2: V2C-Transformer model architecture. (a) Our Video Encoder is designed to take video frames as input and encode them to obtain frame-wise representations. The final video encoding is collected by concatenating hidden outputs from each LSTM module. (b) Our Decoder module consists of a Caption Decoder and a Commonsense Decoder, both sharing architectures indentical to the Transformer Decoder. Ground truth caption/commonsense is encoded and fed in the block as input during training together with the video encoding. “

” represents vector concatenation operation.

(c) The Transformer Decoder module is a stack of N consecutive transformer blocks (shown inside the dashed area). The output from each transformer block is the input query for the preceding block.

We propose our Video2Commonsense Transformer, a cross-modal model that generates captions and commonsense-enriched descriptions from videos. Our approach (Figure 2) adopts the “encoder-decoder” design: a video encoder that extracts global representations of the input video, and a transformer decoder that produces relevant commonsense knowledge along with captions.

Video Encoder: Given an input video , we produce frame-wise feature vectors by using a ResNet-152 [he2016deep]

pre-trained on ImageNet 

[deng2009imagenet]. Then we process this sequence of frame-wise features by deploying an LSTM model [sundermeyer2012lstm], which is considered to have excellent ability for modeling long temporal sequences. We extract the last hidden states

from the LSTM unit as the video representations. In order to encourage the decoder to select and emphasize on more expressive moments, we concatenate all the previous hidden states from each LSTM module as a final global video encoding

, to provide the model with explicit context using the temporal attention mechanism.

Decoder: The video encoding is uses as input to two decoder networks that use a transformer language model [radford2018improving] to generate a caption and commonsense description, using an inference mechanism similar to  [bosselut2019comet]. However, unlike this method, our model is a two-stage process that first predicts the current events directly from videos, and then produces the corresponding commonsense descriptions. During training, the caption decoder takes the video encoding and ground truth caption as input to generate caption encoding, while the commonsense decoder utilizes the concatenation of both video encoding and caption encoding for commonsense decoding (see Figure 1 (b)). This arrangement enables the attention module in commonsense decoder to attend to both the video and caption context. As input embedding, positional encoding is added with the word embedding based on their relative positions, . are the final embeddings for caption and commonsense .


Transformer Decoder is composed of a stack of transformer blocks (see dashed area in (c) Figure 2

), whose main component is a self-attention architecture. It takes as input the summation of word embedding and the positional encoding offset by 1 position through masked multi-head attention, which prevents the future words been seen. In our model, we deploy two stacked decoder architectures for both caption decoding and commonsense knowledge decoding. The Transformer Block consists of consecutive linear transformation: a multi-head attention module (denoted as

), a two-layer feed forward network (

), a layer normalization operation, and a residual connection (see Figure 


Multi-head Attention module: To enable our transformer decoder to generate commonsense descriptions by using both the visual and textual content, we modify the multi-head attention module (which acts as the basic unit in recent transformer based language generation models [radford2018improving, radford2019language]) as a cross-modal module.. takes the input of the embedding of key (K), value (V) and query (Q). The key and value in transformer block are the video encoding (caption decoder) or concatenation of video/caption encoding (commonsense decoder), while query is the output from previous transformer block. In masked multi-head attention module, K, V and Q are the identical vectors of input embedding. For a self-attention block with heads,


is computed by scaled dot-product attention operation:


for head-index , key-dimension n, and transformation parameters .

3 The V2C Dataset

Figure 3: The overall three-step pipeline to construct our V2C dataset.
Type Video Caption Commonsense
Intention Two guys are wrestling in a competition to beat the opponent
Woman and man are singing to express themselves musically
Attribute A guy is singing in a crowd outgoing
Group of riders race on tiny motorcycles. adventurous
Effect A person is making a paper airplane gets excited to let the plane fly
A man and a woman are talking to each other shares ideas and opinions
Table 1: Examples of commonsense annotations (intentions, attributes and effects) retrieved from Atomic for captions in Msr-vtt

For the V2C task we need video clips annotated with commonsense descriptions about the agents in the video, as shown in Figure 1. While there are quite a few video captioning datasets such as Msr-vtt [xu2016msr], the captions in these datasets describe only the observable objects in the image, but not with commonsense-based predictions. We are the first to curate such a dataset with annotations describing the intention of agent to perform an action, the effect of the action and the attribute of the agent given the action.

MSR-VTT contains around 10k videos accompanied by 20 human-annotated textual descriptions per video on average. Each video is 10 to 30 seconds long. The content of videos includes a variety of the topics and scenes (news, sports, music etc.). For training and benchmarking the novel V2C task, we further complement Msr-vtt with event-level commonsense annotations, i.e. event descriptions with intentions, effects and attributes.

ATOMIC [sap2018atomic] is an atlas of everyday commonsense knowledge and contains 880k triplets about causes and effects of human activities, organized as if-then relations, annotated by crowd-sourced workers. This data can be categorized based on causal relations, thereby giving us the categories “cause”, “effect” and “attribute”, e.g., “if X wants to relax, then he will play video game.”

3.1 Querying from ATOMIC and Re-ranking using BERT

Since inferential knowledge in Atomic only covers human activities, we first retain only those captions in Msr-vtt that describe human activities. We then select three queries from Atomic most similar to the caption, and extract the commonsense descriptions corresponding to these queries. In order to select a more reasonable subset of commonsense descriptions, we first train a ranking model. Our ranking model is based on the BERT [devlin2018bert] architecture, trained for a binary classification task to predict the relevance of a candidate commonsense description with respect to the event. We select the top-3 relevant intentions/effects/attributes for each caption. This allows us to obtain a preliminary set of commonsense annotations directly from the Atomic dataset, relevant to the caption, albeit with noise and annotations that are not relevant to the video.

Attribute Textual Retrieval 1.85 4.48 7.90
Attention-Enc-Dec [gao2017video] 3.24 6.99 0.00
Dense Captioning [zhou2018end] 2.05 11.00 9.10
Video CMS Transformer 1.79 15.41 13.32
Effect Textual Retrieval 5.32 18.73 15.25 13.20 12.18 19.42 14.16
Attention-Enc-Dec [gao2017video] 6.83 6.46 2.43 0.00 0.00 3.94 5.48
Dense Captioning [zhou2018end] 6.53 9.67 5.40 3.72 2.86 6.66 8.77
Video CMS Transformer 6.12 11.80 7.81 6.21 5.25 7.82 10.33
Intention Textual Retrieval 5.62 20.22 15.06 11.83 10.29 22.41 17.55
Attention-Enc-Dec [gao2017video] 6.27 27.87 15.33 8.60 4.44 9.56 24.15
Dense Captioning [zhou2018end] 5.88 31.84 17.87 10.53 7.37 10.21 29.40
Video CMS Transformer 5.21 36.54 21.69 13.87 10.73 12.65 32.77


Table 2: Evaluation scores of commonsense completion task using BLEU, Perplexity, Rouge and Meteor metrics. We use only BLEU-1,2 to evaluate the attribute generation since the average length of the ground truth is less than 4

3.2 Detailed Human Annotation

Since we do not use the video to retrieve commonsense descriptions retrived from ATOMIC, we employ human workers to annotate our dataset. We recruit two sets of human workers to watch the video, read the caption and select/annotate the relevant commonsense descriptions for each video. The first set is Amazon Mechanical Turkers (AMT) who select relevant descriptions. The second set is skilled human annotators, screened and chosen from a set of undergraduate and graduate students proficient in English, who are asked to provide annotations in their own words, and remove or edit irrelevant annotations that were provided by ATOMIC. This makes our annotations not only grounded in the video, but also more descriptive, linguistically diverse, and of higher quality (see Figure 3). The descriptions from ATOMIC, although not relevant to the video in some cases, give our workers an idea about the format of annotations desired. The skilled humans reported that of the captions were correct, and that of the ATOMIC descriptions were useful in understanding the annotation task.

Through this procedure, we obtain 6,819 videos for training and 2,903 videos for testing, with 121,651 captions in total, each caption accompanied with 3 intentions/effects/attributes (as in Figure 1). In addition, we also collect a subset of sentences containing commonsense annotations for each video by instructing human annotators to select one of the commonsense candidate and rewrite the raw phrase into complete sentences that complement the captions (see Figure 3). In total we have 3 complete sentences per video for intention/effect/attribute respectively, and this yields a subset that allows our model to generate complete story-like sentences. Table 1 shows illustrative examples from the newly compiled dataset. Finally, we conduct rigorous human evaluations to evaluate the quality of our V2C dataset through AMT (see “Gold Annotations” in Table 3) 111Details about dataset construction and quality evaluation are in Supp. Mat..

4 Experiments

4.1 Training

The decoder is trained to maximize the loss where


where are the model parameters,

denotes the one-hot vector probability of each word at time

. denote the length of the caption and commonsense.

4.1.1 Hyperparameters

Our decoder is a lightweight transformer decoder consisting of 6 transformer blocks with 8 attention heads each. We use Adam optimizer with 5000 warm-up steps, and learning rate initialized at -4, and a dropout probability of 0.1 after the residual layer. Our model is trained on a machine with single NVIDIA 1080-Ti GPU.

4.2 Setting

In order to obtain video representations, we uniformly sample 40 frames from each video and extract features using feed ResNet [he2016deep] pre-trained on Imagenet Ilsvrc12 dataset [deng2009imagenet] and get a 2048-d output from the last layer. We use one-hot input (1-of-N encoding) of the text input and pass it through an embedding layer to produce a 1028-d hidden vector. We use independent vocabularies for captioning and commonsense generation with sizes 27,603 and 24,010 respectively.

Baseline Model: We compare our method with several strong video captioning baseline models. “Attention-Enc-Dec” [gao2017video] is an LSTM based model which is top-performing on MSR-VTT dataset. To compare V2C-Transformer with similar model parameters, we also report the performances of “Dense Captioning” [zhou2018end], which is a transformer based video captioning model. We also report the results using the textual retrieval method, where we follow the initial step of our data construction: querying the Atomic and extract the commonsense with highest probability as the answers.

Task Model Effect Attribute Intention Average Caption
E2C-Completion (Text-Only)
9Enc9Dec [sap2018atomic] 47.52 52.20 51.70 50.47 -
Comet [bosselut2019comet] 55.50 57.48 68.32 60.43 -


Att-Enc-Dec[gao2017video] 66.09 52.40 56.26 58.25 -
VCT-Completion 66.83 63.45 67.37 65.88 -


Att-Enc-Dec[gao2017video] 55.93 74.87 65.54 64.78 74.67
VCT-Generation 62.99 73.54 66.74 67.76 73.17
Gold Annotations V2C Dataset 75.19 83.03 80.11 79.44 95.01
Table 3: Human evaluation scores for V2C. Captions are an input for the V2C-Completion task, and generated for the V2C-Generation task. The best-performing model for each description category and task is given in bold, while the overall best is underlined

4.3 Metrics

We report both the performances evaluated by automatic scores and human evaluations following the protocols from [bosselut2019comet, sap2018atomic]. We evaluate our method using BLEU (n=1-4) [papineni2002bleu], Meteor [banerjee2005meteor], Rouge [lin2004rouge], as well as the perplexity score of the generation on its corpus. During our dataset construction, we put aside all captions and videos that do not have clear human activities. This is because having such videos leads to an imbalance in the number of captions for each video, thus making it inappropriate to just evaluate caption generation using BLEU scores. Thus, besides the aforementioned metrics, we further conduct human evaluations using AMT workers, who are asked to identity whether the generated commonsense justifiably completes the events (V2C-completion). We follow the setup in [sap2018atomic] and randomly sample 100 videos from test set and collect 10 generations for each. To guarantee the objectiveness of the human evaluations, we hire 5 workers for each sample, yielding 30k ratings in total for each model.

4.4 Results

4.4.1 Natural Language Generation Metrics

We show evaluation of the commonsense completion task in Table 2. Compared to the baseline model, our method exhibits a consistent and overall improvement on almost all metrics. Our V2C-Transformer significantly outperforms the LSTM based model in [gao2017video] by 6.3% at BLEU-4 for the intention prediction, surpassing the performances of Textual Retrieval which follows our data construction procedure. It’s worth noting that retrieving commonsense knowledge from Atomic also yields competitive scores with the learning methods, but since this is not learnt, it cannot be applied to novel videos at test time. Because the V2C-Transformer and the LSTM model share a similar video encoder, our performance improvement could be attributed to the use of self-attention mechanisms in the transformer block in decoding phase. This observation is consistent with the conclusion from [bosselut2019comet], and yields further support to the transformer architecture being suited for commonsense inference tasks. Moreover, when compared with the Dense Captioning model with similar transformer architectures and parameters, our model exhibits better evaluation scores, verifying it as a strong baseline model for the V2C task.

4.4.2 Human Evaluation

In Table 3, E2C (Event to Commonsense) is the task of commonsense completion given only textual events [sap2018atomic, bosselut2019comet]. Our V2C completion task differentiates from E2C since our generation and inference is based on visual as well as textual modalities. Nevertheless, E2C provides us with a good starting point for comparison. 9Enc9Dec [sap2018atomic] is composed of nine GRU based encoder-decoders as a baseline model for commonsense completion on text, and Comet [bosselut2019comet] is a large-scale generative pre-trained transformer (GPT) model [radford2018improving]. We would like to highlight that our transformer model is light-weight with only half of the parameters in GPT without any pre-training.

We evaluate our model on the tasks of caption generation with human evaluations, and also compare it with the gold annotations. Our gold annotation for ground-truth captions (sourced from the MSR-VTT dataset) points to the fact that a small percentage of captions from MSR-VTT are not relevant to the video, and this is amended by our human workers.

For the V2C-Completion task, our V2C-Transformer model is substantially better (by 7.73%) than the LSTM-based model from [gao2017video], and shows consistent lead on each dimension. Thus, when the ground-truth caption is given, our model is able to generate much more relevant commonsense descriptions, thereby consolidating it’s ability of commonsense generation.

For the task of V2C-Generation, the difference between human scores for LSTM vs V2C-Transformer is reduced, but our VTC outperforms on average by 2.98%. This may be attributed to the fact that the LSTM-based model is slightly better at generating captions.

4.4.3 Generating Textual Stories with Commonsense:

In order to generate story-like textual descriptions that complement the factual captions, we conduct an additional experiment that further exploits our collected rewritten sentences into the training of our V2C-Transformer model. Specifically, instead of producing the commonsense knowledge given the videos and captions, we finetune our pre-trained V2C-Transformer model on predicting the human rewritten texts, and generate complete story-like captions. Since we do not have enough annotations per sample to compute a fair BLEU score for comparisons, we showcase some sample generated descriptions for qualitative analysis. With that, we observe V2C-Transformer is able to produce complete stories that contain simple, while logically consistent storylines that complement both the visual content and the factual descriptions. We believe that collecting a set of story-like sentences will further enrich our models, and allow us to generate much more contextual, creative, and natural commonsense descriptions from a video.

5 V2c-Qa

Type Model top-1
p r p r p r

MSR-VTT QA [xu2017video] 9.68 2.13 7.15 4.68 6.07 6.60
V2C-T 10.34 2.31 7.69 5.03 6.37 6.87
V2C-T + Captions 10.72 2.54 8.08 5.47 6.39 7.20
Pretrained V2C-T 10.77 2.69 8.01 5.58 6.71 7.88
Pretrained V2C-T +Captions 11.04 2.68 7.96 5.70 6.63 7.79

MSR-VTT QA 19.89 5.02 8.04 5.91 5.30 6.49
V2C-T 20.95 5.43 8.65 6.57 5.65 7.06
V2C-T + Captions 20.95 5.32 8.50 6.48 5.76 7.26
Pretrained V2C-T 20.95 5.32 8.63 6.55 5.82 7.49
Pretrained V2C-T +Captions 21.12 5.60 8.70 6.89 5.83 7.68

MSR-VTT QA 46.10 37.22 16.02 49.45 7.49 41.03
V2C-T 59.52 48.30 22.39 51.40 13.97 52.57
V2C-T + Captions 59.74 48.22 23.12 52.44 14.64 54.35
Pretrained V2C-T 60.72 49.00 23.18 52.73 14.98 55.40
Pretrained V2C-T +Captions 59.57 48.24 23.10 52.54 14.94 54.91

Table 4: Precision (p) and Recall (r) for V2C-QA for each type of question.
Figure 4: Example questions from V2C-QA. As opposed to conventional visual question answering, our questions are about the unobservable commonsense aspects such as intention, effect, and attribute.

Another way of generating commonsense descriptions about the video is by asking pointed questions. Consider the example in 1 where we ask the question “What happens next to the runners”, about the effect of the action “prepare” performed by the agents “group of runners” observed in the video. We propose a V2C-QA – an open-ended commonsense video question-answering task, where we ask questions about the intents, effects and attributes of the agents in the video.

5.0.1 Dataset

We use the caption and commonsense annotations in the V2C dataset, and extract the action and subject from the caption using SpaCy linguistic features [honnibal-johnson:2015:EMNLP]. For each intention, attribute and effect for a video, we use Spacy and template-based generation to get 7 types of questions – yielding 21 questions per sample. Similar to the work on answering negations of questions in [gokhale2020vqalol], we also create negative questions in V2C-QA. In total, we have 1250 training videos and 250 test videos, and a total of 37k questions. We have a set of 5555 unique answers for our questions. Each question can have multiple possible true answers as shown in the example in Figure 4. The V2C-QA task asks questions that require commonsense reasoning about internal mental states, motivations, and latent aspects of intention, effect and attribute of agents in the video as opposed to the conventional video-QA task that contains questions about visible objects and actions11footnotemark: 1.

Models We utilize our V2C-Encoder followed by an open-ended answering module. We use an attention module for the type of question which provides us feature rich representations that attend to answers corresponding to the question-type. For textual features, we use embeddings from BERT-base [devlin2018bert]. Our models are trained on the open-ended QA task and set-up as a multi-label classification task similar to VQA [antol2015vqa], with an answering module design inspired by LXMERT [tan2019lxmert]

. Our loss function includes the classification loss for answering, the attention loss for question-type, and a label-ranking loss.

Results MSR-VTT QA [xu2017video] acts as a good baseline since it is trained on a conventional videoQA task on the MSR-VTT videos. However this model is trained for a multiple-choice QA scheme, so we modify it with our open-ended answering module. We compare our models with when we use our encoder pretrained on the V2C caption generation task, and then finetune it on the V2C-QA task. We also train models with ground-truth factual captions as input. Our results are shown in Table 4

, where we evaluate on prediction of top-1, top-3, and top-5 answers, and report precision and recall. It can be seen that using our encoder pre-trained on the V2C task outperforms all other models. Attribute-related questions are easier to answer, while the models struggle the most for questions about intention. Captions help in questions about effects.

6 Related Work

6.0.1 Textual Commonsense

Commonsense-based text generation has recently been explored in the natural language processing community. The A

tomic dataset introduced by  [sap2018atomic] is a corpus of 877k textual descriptions of inferential knowledge organized as if-then relations (e.g., “If X PersonX wants to report a crime, PersonX calls the police.”). It covers 24K common event phrases with 9 types of relations (“causes”, “effects” and “statives”). The commonsense ground truths are gathered by crowd-sourcing Amazon Mechanical Turkers (AMTs), who are asked to describe multiple plausible intentions, effects and attributes.  [bosselut2019comet] adopts the Atomic dataset to learn a generative model of commonsense knowledge. We recognize Atomic

as a critical milestone in incorporating textual commonsense in contemporary machine learning algorithms for processing and understanding natural language. Unfortunately, we do not find similar works in the domain of computer vision that attempt to

generate visual commonsense for visual understanding. Our work is the first step towards this end – a generative model to learn commonsense textual descriptions directly from videos.

Video Captioning Video captioning is crucial for understanding videos and describing their content in natural language. Recent advances in video captioning have been able to generate captions that describe humans and objects in the video. Works such as  [krause2016paragraphs, krishna2017dense] seek to generate paragraphs or multi-sentence captions about the image or video. It is important to note that video captioning systems have a major limitation in that they can only generate factual descriptions about observable objects or events in the video. However, for detailed video understanding, one needs to obtain descriptions that go beyond observable visual entities and use background knowledge and commonsense to reason about objects and actions in the video. The attempts to infer motivations of human actions are also reflected in  [pirsiavash2014inferring, vondrick2016predicting], which seek to incorporate commonsense knowledge extracted from text into predicting motivations on static images. Commonsense caption generation has been approached on abstract scenes and clipart images in  [vedantamLinICCV15]. We present the first generative model for commonsense video captioning.

Video QA Since caption generation can only describe observable events, recent work seeks to move closer to comprehension and understanding, by learning to answer complex questions about videos. However, these datasets focus only on existing visual concepts that are directly evident from the video and construct the questions mostly about “where” and “what”  [yang2003videoqa, zhu2017uncovering]. For creating questions about high-level semantics such as motivation and intent of human actions,  [tapaswi2016movieqa, lei2018tvqa] collect limited questions about “why” and “how”, which can only be solved by reasoning on contextualized knowledge. We introduce a novel open-ended video question answering task in this paper, where the questions are about three aspects of commonsense human behavior.

Reasoning in Visual Question Answering  [zellers2019recognition] propose a challenging visual question-answering task that requires answering a question and providing a rationale behind the answer. This requires understanding of both visual and linguistic cues along with commonsense reasoning. Similarly, [lei2018tvqa] and [zadeh2019social] propose video-based QA tasks with open-ended high-order questions that need multi-modal understanding, social intelligence modeling, and visual reasoning. Spatial and compositional reasoning is required to answer questions about synthetic images in CLEVR[johnson2017clevr] and about natural images in GQA[hudson2019gqa].

Reasoning about Actions Another aspect of visual reasoning is the ability predict a sequence of actions (procedure planning), or to reason about intermediate configurations (walkthrough planning) between two image frames. This topic has been explored for images and instructional videos in [gokhale2019blocksworld, chang2019procedure] respectively. However these works deal with developing reasoning capabilities in order to answer questions or predict plans, but not for generating commonsense descriptions.

7 Outlook

A video typically contains one or many objects (sometimes moving or performing actions) in different backgrounds, scenes, or situations. Some objects may be “passive” such as trees or buildings, while some objects may be “active” such as people performing actions like walking, singing, and driving. This paper is focused on describing such active agents in terms of their intentions, effects of their actions, and attributes that characterize these agents.

Video Captioning vs Commonsense Description: We distinguish V2C from the traditional computer vision task of video captioning. Video captions describe observable objects, background, and actions. On the other hand, commonsense descriptions in our task seek to describe the latent, i.e. unobservable aspects of the video, such as intentions of the agent (a pre-condition or mental condition), effects of the action (that happen in the future and thus cannot be “seen” in the video), and attributes which characterize the agent and their qualities. Thus commonsense generation goes beyond the visible.

Structure of Commonsense Descriptions: Ours is the first attempt at developing a generative vision-commonsense model. No other traditional reasoning models are generative, nor do they have the capability of taking video input. Our work may pave the way for future work that explores other aspects of commonsense generation, with a scope for making the description more natural and linguistically diverse.

Potential Applications: We anticipate that our V2C-Transformer framework can be utilized for exciting real-life applications, such as a sports commentator engine which produces unique opinions about the competition (as is shown in Figure 1). Household assistive robots may also benefit from this capability, by being able to understand intentions from your action (e.g. the robot that can look at your action of reaching for your coffee mug, understand it as the intention of drinking more coffee, and fetch coffee for you).

8 Conclusion

In this paper, we explore a novel and challenging task to generate video descriptions with rich commonsense descriptions that complement the factual captions. We expand the existing video captioning dataset for the V2C task through both extensive human augmentation, and present a novel V2C-Transformer model to serve as a strong baseline method for the V2C task. Our evaluation verifies the effectiveness of our method, while also indicating a scope for further study, enhancement, and extensions in the future. Our experiments on using the V2C-Transformer as a component for the V2C-QA task show that the model has transfer learning capabilities that can be applied to vision-and-language task that require commonsense reasoning.


Appendix 0.A V2C Dataset Construction

Our dataset creation methodology is a three step procedure as shown in Figure 5. In the first step, we use the caption to query ATOMIC [sap2018atomic] and retrieve the top-3 intentions, effects, and attributes, which are then re-ranked by a BERT based model in the second step. Our third and final step involves humans in the annotation process. We ask human annotators to select the most relevant descriptions, and to provide additional descriptions in their own words. The annotators also convert a subset of our dataset into complete sentence descriptions. We describe this procedure in detail below.

Figure 5: The data creation flow for V2C. We use the retrieved videos and captions from MSR-VTT and use the BERT re-ranking module to obtain a list of top-3 intentions (), effects (), and attributes (). These are then further improved by human labeling. A subset of annotations is also converted to full sentences by human annotators.
Figure 6: Next sentence prediction task in Bert model. A and B sentences are separated by special token [SEP].
Commonsense Type Accuracy (%)
Intention 84.87
Effect 86.53
Attribute 87.23
Average 86.21
Table 5: Accuracy of our BERT model for the task of next sentence prediction on the Atomic test dataset split

0.a.1 Querying from ATOMIC

For every video-caption pair in the Msr-vtt dataset, we select 3 most similar events from Atomic. These are then used to retrieve textual descriptions of three types – intentions, effects, attributes from Atomic.

Figure 7: Qualitative examples of our V2C dataset.

0.a.2 BERT Ranking Model

We implement a Bidirectional Encoder Representations from Transformers (BERT) [devlin2018bert] as a ranking model to rank and retrieve top-3

most plausible commonsense aspects completing the ground truth caption. More formally, we treat the ranking training as a binarized next sentence prediction (NSP) task 

[devlin2018bert] that can be trivially generated from Atomic [sap2018atomic] dataset. To be specific, when choosing the sentences A and B for each training pair, for 50% of the training pairs we choose the actual next sentence that follows A, and a random sentence from the Atomic as a negative sentence. This setting is consistent with the NSP task in [devlin2018bert]. We train our model in Atomic, and use it to expand video captions from Msr-vtt [xu2016msr]. Our BERT model consists of 12 transformer blocks, 12 attention heads, and 768 hidden dimensions (110M parameters in total). In total, we have 115,312 pairs for training/testing. We evaluate our model using accuracy of the prediction in the test set of Atomic which is 30% of the entire set. BERT can achieve 86.21% accuracy in NSP task on average. In addition, we also conduct human evaluations to measure the overall quality of the expanded V2C dataset (see “gold annotations” in Table. 3, main paper).

0.a.3 Human Labeling

With querying from ATOMIC and BERT re-ranking, we obtain commonsense descriptions that are relevant to the caption. However, we want to make sure that these descriptions are also relevant to the video. Thus we utilize human workers from Amazon Mechanical Turk (AMT) for selecting the most relevant commonsense descriptions. Our annotation interface is shown in Figure 8. We ask the annotators to select descriptions that are most relevant to the video and to the caption, and also encourage them to add their own commonsense descriptions. This makes our dataset more natural and human-like. This also allows us to remove noisy annotations that may be produced due to text-only ATOMIC querying.

Figure 8: Our human labeling interface. We ask human workers to select relevant commonsense descriptions as well provide additional texts in their own words
Figure 9: Word cloud figure of the intention commonsense annotations from our V2C dataset.

We show additional samples of our constructed V2C dataset in Figure. 7. Besides, Figure. 9 and 10 demonstrates the word cloud and the frequency of words in V2C dataset.

Figure 10: Top-100 most frequent words in our V2C dataset (stop words are ignored).

0.a.4 Benefits of the Three-Step Pipeline

Since our videos are annotated with captions, we use the captions to retrieve commonsense descriptions from ATOMIC. The Atomic dataset has comprehensive annotations for human activities, actions, and events and as such covers most of the events in Msr-vtt. Thus using these two datasets together is a natural step for creating our V2C dataset.

This purely caption-based retrieval unfortunately does not incorporate the latent aspects of the video, but only those from the caption. Moreover, since the video is not used for retrieving these, the commonsense annotations may be out-of-context. Thus, we bring in human annotators to watch the video, read the caption, and then use the set of descriptions from ATOMIC to select the relevant once and to discard the irrelevant or out of context descriptions. The human annotators then provide annotations about intention, effect, and attribute in their own words. The ATOMIC retrieved descriptions help the human annotators to get an idea about the task and also get a glimpse of the format of the desired annotations. This significantly reduces the noise in human annotations.

To guarantee and measure the overall quality of our V2C dataset, we have conducted human evaluations on the V2C annotations. Our results shows that 86.29% of the video-caption-commonsense triples are labeled as reasonable samples (see “Gold Annotations” in main paper, Table. 3), verifying the quality of our dataset

Appendix 0.B V2C-QA Dataset

For the V2C Question Answering task, we repurpose our V2C dataset and convert it to a question-answering dataset. We choose a subset of 1500 videos: 1250 for training and 250 for testing, following the same train-test split as MSR-VTT. We use SpaCy linguistic features [spacy2] along with the LemmInflect library222 and template-based generation to convert the captions, intentions, effects, and attributes from V2C to create questions and ground-truth answers. Our templates are lingustically diverse, natural, and grammatically sound. We have 21 types of templates with each template having numerous possibilities for combinations of the slots in the template. Thus we get 21 types of questions (7 each for intention, effect, and attribute) as shown in Table 6. Since our task is open-ended question-answering, our questions are annotated with all possible correct answers for that question. To get answers for the “negative” questions as shown in Table  6, we use the adversarial matching strategy similar to [zellers2019recognition], by using RoBERTa [liu2019roberta] similarity. We will release our V2C-QA question and answer generation code publicly.

Question Type Question Answer
Intention What might be the goal of the person? to record a music video
Intention (Negative) What could the person not want to achieve? to bake a cake
Intention (Action) What prompts the person to do the action? to express themselves
Intention (Action, Negative) What did not lead the person to act like that? to feed the dog
Intention (Why) Why might the person be doing the action? to entertain viewers
Intention (Yes-No) Does the person wish to express himself? Yes
Intention (Yes-No, Negative) Does the person want to not get recognition? No
Effect What will the person do after this? puts the video on YouTube
Effect (Negative) What does not happen as a result? the person gets sad
Effect (Action) What does the dancing end up in? becomes tired
Effect (Action, Negative) What will not happen due to the action? feels tense
Effect (How) How does the person feel after performing? feels accomplised
Effect (Yes-No) Could the person put it on YouTube as a result? Yes
Effect (Yes-No, Negative) Will the person not learn a new dance? No
Attribute What trait does the man possess? musical
Attribute (Negative) What attribute does not match with the person? angry
Attribute (How) How can the person be described? entertaining
Attribute (Action, How) How can the dancing person be characterized? rhythmic
Attribute (Yes-No, Action) Is the person who is singing smiling? Yes
Attribute (Yes-No) Is the person entertaining? Yes
Attribute (Yes-No, Negative) Is the person not tense? Yes
Table 6: Examples of open-ended V2C-QA samples

Appendix 0.C Qualitative Generation Results

We show additional V2C-Completion samples by our V2C-Transformer model in Table. 7.

Intention Caption Effect Attribute
to entertain people a band is performing for a crowd gets applause acting
to try out PersonY’s new car a man checks out detail on a car gets a speeding ticket helpful
to learn about current events a complex news host gives an update on rappers. gets informed about current political events talkative
to be in a good mood a group of people trying to perform an exorcism on a girl gets applause fun
to show his knowledgeable there is an old man is answering to somebody questions gets another question sporty
to score a point a man is shooting a basketball ground gets exercise helpful
to share their message a man giving a speech to important people gets applause orator
to be safe from anything that lurks in the dark a group of people are being chased by crocodiles gets tired from taking pictures scared
to be informed about the world a girl is describing about hot news learns about whats happening worldwide gossipy
to watch something interesting a children s television show clip smiles at the screen entertained
to enjoy the evening with the concert band a band composed of older gentlemen are playing blue grass music on a small stage and people are dancing along to the music swing-style gets tired form dancing fun
to be part of the team there is a woman playing badminton in a court gets tired after exercise athletic
to try out person ys new car a boy explaining the features of a car they check car websites online to look at deals helpful
to escape reality a man explaining a video game takes the video game home gamer
to cook something there is a man in black cutting the green leaves on the desk gets clean dishes hungry
Table 7: Illustrative samples generated by our V2C-Transformer model on V2C-completion task.

Appendix 0.D Human Evaluation

Human evaluation is one of the important part to verify the performances of our model and the quality of the V2C dataset. In this section we describe our setup for human evaluation of the captions and commonsense descriptions in our dataset as well as those generated by our models.

0.d.1 Amazon Mechanical Turk Interface

We conduct our human evaluations by crowdsourcing ratings from workers on Amazon Mechanical Turk (AMT). We do these human evaluations on the same test set used for our automated metrics. We show an example of our interface in Figure 16 and  17 which shows the screenshot of the rating task as seen by the workers. The workers are given explicit instructions about this rating task, and depending on the task are asked to rate the commonsense descriptions and the caption. For the V2C-Completion task, the workers are provided with the video and the ground-truth caption and asked to rate the only the generated commonsense (intention, effect or attribute) on a scale of 1 to 5. The workers are asked to provide this rating on the basis of whether the generated text is relevant to the video, i.e whether the caption/commonsense can plausibly complete the given event. For the V2C-Generation task, the workers are asked to rate the caption as well as the commonsense texts with respect to the video. The workers are also asked to conduct identical tasks for the gold (ground-truth annotations) in our new V2C dataset.

0.d.2 Scheme for Validity

Our ratings are measured on a scale of 1 to 5. Annotations which receive a score greater than 3 are considered “valid”. We do so to be consistent with the metrics used by  [bosselut2019comet] for their experiments which use a binary rating for each sample. We then compute average validity scores for each commonsense aspect: intention, attribute and effect.

0.d.3 Statistics of Human Evaluations

In order to further analyze the human evaluations on our generated outputs, we use three metrics - standard deviation of the ratings, inter-rater agreement score (IRAS) and a smooth version of IRAS. Standard Deviation was calculated per sample based on the evaluations provided by multiple workers on each sample. We do so to evaluate how consistent our AMT workers are and how much they deviate or agree with each other. We use three different metrics so as to analyze our data and generations through multiple lenses, to be certain that the outputs and annotations are high-quality.

Figure 12: V2C-Completion task using our V2C-Transformer model.
Figure 13: V2C-Completion task using the AttEncDec model.
Figure 11: V2C-Completion task using the AttEncDec model.
Figure 12: V2C-Completion task using our V2C-Transformer model.
Figure 13: V2C-Completion task using the AttEncDec model.
Figure 14: V2C-Generation task using our V2C-Transformer model.
Figure 15: Standard deviation histograms of human ratings across models and split (From left to right: Intention, Attribute, Effect). X-axis denotes standard deviation value and Y-axis denotes percentage of test set samples.
Figure 11: V2C-Completion task using the AttEncDec model.

Inter-Rater Agreement Score is computed as the average of the percentage of raters for each sample that agree with the majority opinion. Let be the set of ratings for test sample . Let be the size of the test-set.


Then the mode is defined as the most frequently occurring (majority) rating in the set of ratings :


Inter-Rater Agreement Score is the average percentage of raters that agree with the majority opinion :


where is the indicator function.

Smooth Inter-Rater Agreement Score While IRAS acts as a good metric to find out how our dataset fares in terms of rater agreement, it suffers from a flaw. Irrespective of the value of ratings, the indicator function returns 0 for the tuple of ratings () as well as (), although the ratings of 4 and 5 are close to each other but 1 and 5 are opposite. So to avoid this, we replace the indicator function with a smooth exponential term. The smooth inter-rater agreement score is given by:


0.d.3.1 Results

Type Std. Dev (%) IRAS(%) smooth-IRAS (%)
AttEncDec V2C-Transformer AttEncDec V2C-Transformer AttEncDec V2C-Transformer

Intention 17.99 15.02 56.02 59.80 69.43 73.36
Effect 19.63 18.39 58.03 56.76 69.28 69.47
Attribute 10.54 9.74 69.06 71.28 80.24 81.83
Average 16.05 14.38 61.04 62.61 72.98 74.89

Intention 17.60 16.27 57.84 58.47 70.66 72.10
Effect 18.54 17.56 56.69 57.40 69.54 70.21
Attribute 15.42 13.16 59.80 62.25 73.51 76.12
Average 17.19 15.66 58.11 59.37 71.24 72.81
Table 8: A comparison of the statistics of human evaluation scores for both tasks using the baseline (AttEncDec model vs. our model (V2C-Transformer)

Table  8 shows our analysis in terms of the three metrics described above. Our V2C-Transformer architecture consistently outperforms the baseline model AttEncDec [gao2017video] in all three metrics for each type of commonsense. This means that raters are more consistent with their ratings (in terms of deviation or agreement) when it comes to commonsense descriptions generated by our model.

Figure 16: Snapshot of our AMT human evaluation interface for V2C-completion task.
Figure 17: Snapshot of our AMT human evaluation interface for V2C-generation task.