Humans possess a basic knowledge about facts and understandings for commonsense of causality in our everyday life. For example, if we leave five minutes late, we will be late for the bus; if the sun is out, it’s not likely to rain; and if we are hungry, we need to eat. Such causality knowledge has been shown to be helpful for many NLP tasks [1, 2, 3]. Thus, it is valuable to teach machines to understand causality .
Causal relations in the commonsense domain are typically contributory and contextual . By contributory111The other two levels are absolute causality (the cause is necessary and sufficient for the effect) and conditional causality (the cause is necessary but not sufficient for the effect), which commonly appear in the scientific domain rather than our daily life., we mean that the cause is neither necessary nor sufficient for the effect, but it strongly contributes to the effect. By contextual, we mean that some causal relations only make sense in a certain context. The contextual property of causal relations is important for both the acquisition and application of causal knowledge. For example, if some people tell the AI assistant (e.g. Siri) “they are hungry” in a meeting, a basic assistant may suggest them to order food because it has the knowledge that ‘being hungry’ causes ‘eat food’. A better assistant may suggest ordering food after the meeting because it knows that the causal relation between ‘being hungry’ and ‘eat food’ may not be plausible in the meeting context. Without understanding the contextual property of causal knowledge, achieving such a level of intelligence would be challenging.
To help machines better understand the causality commonsense, many efforts have been devoted into developing the causality knowledge bases. For example, ConceptNet  and ATOMIC  leverage human-annotation to acquire small-scale but high-quality causality knowledge. After that, people try to leverage linguistic patterns (e.g., two events connected with “Because”) [8, 9, 10] to acquire causality knowledge from textual corpus. However, causality knowledge, especially those trivial knowledge for humans, are rarely formally expressed in documents , a pure text-based approach might struggle at covering all causality knowledge. Besides that, none of them take the aforementioned contextual property of causal knowledge into consideration, which may restrict their usage in downstream tasks.
In this paper, we propose to ground causality knowledge into the real world and explore the possibility of acquiring causality knowledge from visual signals (i.e., images in time sequence, which are cropped from videos). By doing so, we have three major advantages: (1) Videos can be easily acquired and can cover rich commonsense knowledge that may not be mentioned in the textual corpus; (2) Events contained in videos are naturally ordered by time. As discussed by , there exists a strong correlation between temporal and causal relations, and thus such time-consecutive images can become a dense causality knowledge resource; (3) Objects from the visual signals can act as the context for detected causality knowledge, which can remedy the aforementioned lack of contextual property issue of existing approaches.
To be more specific, we first define the task of mining causality knowledge from time-consecutive images and propose a high-quality dataset (Vis-Causal). To study the contextual property of causal relations, for each pair of events, we provide two kinds of causality annotations: one is the causality given certain context and the other one is the causality without context. Distribution analysis and case studies are conducted to analyze the contextual property of causality. An example from Vis-Causal is shown in Figure 1, where the causal relation between “dog is running” and “blowing leaves” only makes sense when the context is provided because the dog is running on the leaves, so its high speed and quickly-moved pow cause the leaves blow around. Without the context “leaves on the ground”, this causal relation is implausible. After that, we propose a Vision-Contextual Causal (VCC) model, which can effectively leverage both the pre-trained textual representation and visual context to acquire causality knowledge and can be used as a baseline method for future works. Experimental results demonstrate that even though the task is still challenging, by jointly leveraging the visual and contextual representation, the proposed model can better identify meaningful causal relations from time-consecutive images. To summarize, the contributions of this paper are three-fold: (1) We formally define the task of mining contextual causality from the visual signal; (2) We present a high-quality dataset Vis-Causal; (3) We propose a Vision-Contextual Causal (VCC) model to demonstrate the possibility of mining contextual causality from the vision signal.
The rest of the paper is organized as follows. In Section 2, we formally define the task of learning contextual causality from the visual signal. After that, we present the construction details about Vis-Causal in Section 3. In Section 4 and 5, we present the details about the proposed VCC model and the experiments. In the end, we use Section 6 to introduce related works and Section 7 to conclude this paper.
2 The Task Definition
As introduced in the introduction, the ultimate goal of this work is to acquire contextual causality knowledge from videos. However, as current models cannot afford processing videos directly, we simplify the task into mining causality knowledge from time-consecutive images, which are cropped from the video. Thus, we formally define the task as follows. Each image pair , where is the overall image pair set, consists of two images and , sampled from same video, in a temporal order (i.e., appears before ). For each , our goal is to identify all possible causal relations between the contained images. Normally, this task contains two sub-tasks: identifying events in images and identifying causality relation between contained events. As there exists a huge overlap between the event identification task and the scene graph generation task 
in the computer vision (CV) community, which has been extensively studied[12, 13], in this work, we focus on the second sub-task. We assume that the event sets contained in is denoted as and the event sets contained in all images sampled from is denoted as . For each event , our goal is finding all events such that causes .
3 The Vis-Causal Dataset
In this section, we introduce the details about the creation of Vis-Causal, which was carried out in four steps: (1) Pre-processing the raw video data into frames for further annotation; (2) Identifying the contained events from the frames; (3) First-round of causal annotation, which requires annotators to write down events that can be caused by the given event; (4) Second-round of causal annotation, which refines the quality of the annotation from step three via more fine-grained annotation and two settings are included (one is with the visual context and the other one is without any context). We select Amazon MTurk222https://www.mturk.com/ as the annotation platform and show survey examples in the appendix. We elaborate on each step as follows.
3.1 Data Source
In order to acquire a broad coverage of daily life causality, we chose to use ActivityNet , which contains short videos from YouTube, as the video resource. We randomly selected 1,000 videos. For each video, we took five uniformly sampled screen-shots and took adjoined screen-shots as pairs of time-consecutive images to better capture the chained events. As a result, we collected 4,000 image pairs.
3.2 Event Identification
In the first step, for each image pair, we invited three annotators to write down any events they can identify in the first image. Clear instructions and examples were given to help annotators understand the definition of events and our task. We invited three annotators for each image, resulting in 12,000 events in total for 4,000 image pairs.
3.3 First-round Causal Annotation
After identifying events, we invited annotators to identify related event pairs, which are used as candidates for next-step causal relation annotation. For each pair of time-consecutive images, we selected all three identified events in the first image, and for each one of them, we asked annotators to describe one event that happens in the second image and is caused by the selected event, which happens in the first image. If no suitable event was found, which is possible due to the sparseness of causal relations, the annotators could choose ‘None’ as the answer. For each question, we invited three different annotators to provide annotations. After filtering out answers that contain ‘None’ or has less than two words, we obtained 23,558 event pair candidates. On average, we kept 5.89 candidate event pairs for each image pair.
3.4 Second-round Causal Annotation
As crowd-workers, unlike experts, are less likely to effectively distinguish the difference between inference and causality (For example, ‘A cat is running’ infers ‘An animal is running’ but ‘A cat is running’ doesn’t cause ‘an animal is running’), such relations are often mistakenly annotated as causal relation in the first-round annotation based on our observation. To remedy this problem, in the second round of annotation, we first present a clear definition and several examples of all possible relations (e.g., ‘Inference’ and ‘Causality’) and then ask annotators to select the most plausible relations for all candidate event pairs, which is more fine-grained and thus achieves the better annotation quality. Besides that, to investigate the contextual property of causality knowledge, two settings are considered for the annotation (one with the context and one without). For each setting, we invite five annotators for annotating each event pair. Following previous works [15, 16], we employ Inter Annotator Agreement (IAA), which computes the average agreement of an annotator with the average of all other annotators, to evaluate the overall annotation quality. As a result, we achieve 78% and 76% IAA scores for “with context” and “without context” settings respectively. We achieve slightly lower agreement in the “without context” setting. One possible explanation is that when no context is provided, different people may think about different contexts and thus their annotations about causal relations could be slightly different.
3.5 Annotation Analysis
The distribution of annotation results for both settings are shown in Figure 2(a) and Figure 2(b) respectively. For each pair of events, we compute the plausibility based on voting. For example, if four out of five annotators vote “causal”, its causal plausibility is 0.8. In general, we can see that for both settings, the majority of the candidate events pairs have weak causal relations and only a small portion of the candidates contain strong causal relations, especially for the “with context” setting. One possible explanation is that when no context is provided, humans can think about multiple contexts and find the most suitable one such that the causal relation is plausible in that scenario. However, when the visual context is provided, where the scenario is fixed, humans do not have the freedom to choose a suitable scenario by themselves. As a result, Annotators tend to annotate more plausible causal relations in the “no context” setting.
To investigate the contextual property of causal relations, we show the distribution of plausibility difference (“with context” minus “without context”) in Figure 2(c). From the result, we can observe that about 6% of event pairs, which is indicated with the dashed box, have stronger causal relations without any context, while about 1% of event pairs, which is indicated with the solid box, have a stronger causal relation when the visual context is provided. Two examples of both cases are shown in Figure 2(d) and 2(e) respectively.
3.6 Dataset Splitting and Statistics
We split the dataset into train, dev, and test sets based on the original split of ActivityNet  and collect 800, 100, and 100 videos for the train, dev, and test set respectively. We select positive causal relations based on the annotation under the “with context” setting. If at least four of five annotators think there exists a causal relation between a pair of events given the context, we will treat it as a positive example. As a result, we got 2,599, 329, and 282 positive causal pairs for the train, dev, and test set, respectively. On average, each event pair contains 11.41 words and the total vocabulary size is 10,566. We summarize the detailed dataset statistics in Table 1.
4 The VCC Model
In this section, we introduce the proposed Vision-Contextual Causal (VCC) Model, which leverages both the visual context and contextual representation of events to predict the causal relations, and we show the overall framework in Figure 3
. In total, we have three major components: event encoding, which encodes the two events into vectors for further prediction; visual context encoding, which encodes the context frame such that the context can be utilized in the model; and cross attention, which aims at finding the best context and event representation via the attention mechanism. The details about these components are introduced as follows.
4.1 Textual Event Encoding
As both and are represented with natural language, we begin with converting them into vector representations. In this work, we leverage a pre-trained language representation model BERT  to encode all events. Assuming that after the tokenization, event contains n tokens , we denote their contextualized representations after BERT as .
4.2 Visual Context Encoding
Following the common approach in multi-modal approaches approaches [11, 19] we first leverage an object detection module to detect objects from images and use all extracted objects to represent the visual context. Assuming that for and , we extract and objects respectively. After combining all objects from two images together and sorting them based on the confidence score provided by the object detection module, we keep the top objects and denote them as . The motivation for that operation is to avoid the influence of noise introduced by the object detection module. As all objects are in the form of words, to align with events, we use the same pre-trained language representation model to extract the vector representation333If an object word is tokenized to multiple tokens, we take their average representation as the token representation. of selected objects and denote them as .
4.3 Cross-Attention Module
The purpose of the cross-attention module is to minimize the influence of noise by selecting important context objects with events and informative tokens in events with the context. Thus, the cross-attention module contains two sub-steps: (1) context representation; (2) event representation.
Context Representation: For each event , whose tokens’ vector representations are , we first take the average of all tokens and denote the resulted average vector as . As the vector representation set of all selected objects is denoted as , we compute the overall context representation as:
where is the attention weight of on object . Here we compute the attention weight as:
is a standard two-layer feed forward neural network andindicates the concatenation.
Event Representation: After getting the context representation, the next step is computing the event representation. Assuming that the vector set of is , we can get the event representation with a similar attention structure:
is the attention weight we computed with another feed forward neural network.
4.4 Causality Prediction
Assuming that the context representations with and as attention signal are denoted as and respectively and the overall representations of and are and , we can then predict the final causality score as follows:
5 The Experiment
In this section, we present experiments and analysis to show that both the pre-trained textual representation and visual context can help learn causality knowledge from time-consecutive images.
5.1 Evaluation Metric
As each event in the first image could cause multiple events in the second image, following previous works 
, we evaluate different causality extraction models with a ranking-based evaluation metric. Given each eventin the first images, models are required to rank all candidate events based on how likely they think these events are caused by . We then evaluate different models based on whether the correct caused event is covered by the top one, five, or ten ranked events. We denote these evaluation metrics as Recall@1, Recall@5, and Recall@10. In our experiment, all detected events in the same video are considered as negative examples.
5.2 Baseline Methods
To prove that the context is crucial and the proposed cross-attention module is helpful, we compare VCC with the following models:
No Visual Context: Directly predicts the causal relation between events without considering the visual context. For each event, we take the average of word representations as the event representation and concatenate the representations of two events together for the final prediction.
No Attention: Removes the cross-attention module and uses the average word embeddings of all selected objects to represent the context.
ResNet as Context: Removes the object detection module and uses the average image representation extracted by ResNet-152  as the context representation.
Besides the aforementioned baselines, we also present the performance of a “random guess” baseline, which randomly ranks all candidate events and can be used as a performance lower-bound for all causality extraction models.
|No Visual Context||R@1||6.76||9.09||8.47||0.00||9.09||7.45|
|ResNet as Context||R@1||7.43||9.09||1.69||0.00||9.09||6.38|
|The proposed VCC Model||R@1||8.78||7.27||6.78||11.11||27.27||8.87|
5.3 Implementation Details
Model Details: We use BERT  as the textual representation model to encode both events and and objects detected from images. We follow the previous scene graph generation work  to leverage a Faster R-CNN network 
, which is trained on MS-COCO, to detect objects from the images.
We set the hidden state size in the feed-forward neural network to be 200 and the number of selected objects to be 10. The total number of trainable parameters is 109.9 million (including 109.48 million from BERT-base).
During the training phase, for each positive example, we randomly select one negative example and use cross-entropy as the loss function. We employ stochastic gradient descent (SGD) as the optimizer. All parameters are initialized randomly and the learning rate is set to be
. All models are trained with up to ten epochs444All models converge before ten epochs., and the models that perform best on the dev set are evaluated on the test set. The experiments are implemented on Intel(R) Xeon(R) CPU E5firstname.lastname@example.orgGHz and one GTX-1080 GPU. Each training epoch takes 18 minutes on average.
5.4 Result Analysis
We report the performance of all models on all categories and show the results in Table 2, from which we can make the following observations:
All models significantly outperform the “random guess” baseline in almost all settings, which show that models can learn to extract meaningful causality knowledge from these time-consecutive images and learning causality knowledge from the visual signal can be a good supplement for the current text-based approach of acquiring causality knowledge in the future.
With the help of the context information, VCC outperforms the baseline ‘No Context’ model in most experiment settings, which proves the importance of context and is consistent with our previous observation that some causality only makes sense in certain contexts.
The proposed VCC model outperforms the ‘No Attention’ model significantly, which demonstrates the influence of noise introduced by the object detection module and the effectiveness of the proposed cross-attention module.
The proposed VCC model, which first extracts objects from images and then use extracted objects as the context, outperforms the “ResNet as Context” baseline, which directly uses the ResNet encoded image representation as the context, even though it does not suffer from the noise introduced by the object detection module at all. One possible explanation is that the event textual representation and the ResNet encoded image representation are vectors in different semantic spaces (one in language and one in vision), which may not perfectly align with each other. As a comparison, in the proposed model, we first extract objects from images and then encode them as text, and thus the alignment issue no longer exists.
In general, even though the proposed model outperforms all baseline methods and can learn to extract meaningful causality knowledge from the training data, the task is still challenging due to the following reasons: (1) Missing information about the visual context: the performance of current object detection module is not good enough and many important objects related to the two events could be missing; (2) Correct resolution of pronouns: pronouns appear frequently in events and without the correct resolution of those pronouns, it is hard to fully understand the semantic meaning of events; (3) Lack of support of external knowledge, especially commonsense knowledge: the proposed dataset is a great test dataset for evaluating models abilities of understanding causality but its scale is not large enough to cover all the scenarios. It is important to include more knowledge from other resources.
5.5 Can Language Representation Models Understand Causality?
As observed in , language representation models can preserve rich knowledge, in this subsection, we conduct experiments to investigate whether pre-trained language representation models (i.e., BERT  and GPT-2 
) can understand causality without any training. For each candidate event pair (e.g., (“A dog is running”, “The leaves blow around”)), we convert it into a natural sentence (e.g., “A dog is running, so the leaves blow around”), and then input it into the language representation model. The overall probability returned by models can be used as their causality predictions. Higher probability indicates higher plausibility prediction. We rank all candidates by their probabilities and evaluate the models in the same way as previous experiments. We conduct experiments on BERT-base555We also tried BERT large, but it does not make a significant improvement over the base model. and GPT-2 (774 million parameters). with the Hugging face implementation666https://github.com/huggingface/transformers. Besides unsupervised approaches, we also present the performances of replacing the language representation module in VCC with different pre-trained models.
From the experimental results in Table 3, we can see that, compared with the “random guess” baseline, unsupervised BERT and GPT-2 only achieved slightly better performance. The reason behind this is that even though these pre-trained language representation models contain rich semantics about events, they can only distinguish which two events are more relevant rather than identify the causality between them. As all negative examples are also selected from the same video, which makes them very relevant to the target process, the task becomes too challenging for unsupervised models. However, if we incorporate them into the VCC model and further train them, they will achieve much better performance, which shows that if we allow further training, the model will learn how to better use the contained rich semantics and thus achieve better performance.
5.6 Case Study
To further analyze the success and limitation of the proposed VCC model, we present a case study, where predictions are sorted based the prediction scores, in Figure 4, from which we can see that VCC successfully predict that “wash car” can cause “ground becomes wet” and “car is cleaned”, but it also makes mistakes. For example, based on these two images, humans know that the car does not change color but VCC may mistakenly connect the event “change color” with “wash” from some other training examples. In this case, using a few objects to represent images may not be enough to cover all the visual information. Besides that, VCC also predicts “Man sees the car”, which indeed happens in the second image but is not caused by “wash”. Understanding this might need the inference over external knowledge. How to leverage external knowledge to better understand the visual signal and thus acquire more accurate causal knowledge is left for our future investigation.
6 Related Works
In this section, we introduce related work about causality acquisition and visually-grounded NLP.
6.1 Causality Acquisition
As a crucial knowledge for many artificial intelligence (AI) systems 
, causality has long been an important research topic in many communities with different focuses. For example, in the machine learning community, researchers[4, 25] are focusing on modeling causality from structured data (e.g., directed acyclic graph). Different from them, researchers from the computer vision community [26, 27]
are focusing on identifying key objects or events in images that can cause certain decision makings. Last but not least, previous works in the natural language processing (NLP) community are mostly working on acquiring causality knowledge via either crowd-sourcing[6, 7] or linguistic pattern mining  and then applying the acquired knowledge for understanding human language . The ultimate goal of this paper is the same as previous NLP works that we are trying to acquire causality knowledge, which can be stored and used for downstream tasks. But the approach is different. To the best of our knowledge, this is the first work exploring the possibility of directly acquiring causality from the visual signal. Another related work from the CV community is Visual-COPA , which asks models to identify if one image can cause another. The major difference is that our paper is trying to extract causality knowledge rather than leveraging external knowledge to predict image relations.
6.2 Visually-grounded NLP
As the intersection of computer vision (CV) and natural language processing (NLP), visually-grounded NLP research topics are popular in both communities. For example, image captioning aims at generating captions for images and scene graph generation  tries to detect not just entities but also events or states from images. Besides these basic tasks, some other visually-grounded NLP tasks (e.g., visual question answering (VQA)  and visual dialogue ) are also created to test how well models can understand human language and visual signals jointly. Another line of related works is visual commonsense reasoning [31, 32, 33, 34], which aim at either extracting commonsense knowledge from the images or evaluating models’ commonsense reasoning abilities over images. Considering that the causality often happens between events that appear in the temporal order, which are unlikely to appear in the same image, we choose to work on time-consecutive image pairs rather than a single image.
In this paper, we explore the possibility of learning causality knowledge from time-consecutive images. To do so, we first formally define the task and then create a high-quality dataset Vis-Causal , which contains 4,000 image pairs, 23,558 event pairs, and causal relation annotations under two settings. On top of the collected dataset, we propose a Vision-Contextual Causal (VCC) model to demonstrate that with the help of strong pre-trained textual and visual representations and careful training, it is possible to directly acquire contextual causality from visual signals. Further analysis shows that even though VCC can outperform all baseline methods, it is still not perfect. As the visual signal could serve as an important causality knowledge resource, we will keep exploring how to better acquire causal knowledge from the visual signal (e.g., leveraging external knowledge) in the future. Both the dataset and code will be released to encourage research on the causality acquisition.
This paper was supported by Early Career Scheme (ECS, No. 26206717), General Research Fund (GRF, No. 16211520), and Research Impact Fund (RIF, No. R6020-19) from the Research Grants Council (RGC) of Hong Kong. This research was also supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the BETTER Program, and by contract FA8750-19-2-1004 with the US Defense Advanced Research Projects Agency (DARPA). The views expressed are those of the authors and do not reflect the official policy or position of the Department of Defense or the U.S. Government.
-  Jong-Hoon Oh, Kentaro Torisawa, Chikara Hashimoto, Motoki Sano, Stijn De Saeger, and Kiyonori Ohtake. Why-question answering using intra- and inter-sentential causal relations. In Proceedings of ACL 2013, pages 1733–1743, 2013.
-  Chikara Hashimoto, Kentaro Torisawa, Julien Kloetzer, Motoki Sano, István Varga, Jong-Hoon Oh, and Yutaka Kidawara. Toward future scenario generation: Extracting event causality exploiting semantic relation, context, and association features. In Proceedings of ACL 2014, pages 987–997, 2014.
-  Qiang Ning, Zhili Feng, Hao Wu, and Dan Roth. Joint reasoning for temporal and causal relations. In Proceedings of ACL 2018, pages 2278–2288, 2018.
-  Judea Pearl and Dana Mackenzie. The book of why: the new science of cause and effect. Basic Books, 2018.
-  Mario Bunge. Causality and modern science. Routledge, 2017.
-  Hugo Liu and Push Singh. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
-  Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. ATOMIC: an atlas of machine commonsense for if-then reasoning. In Proceedings of AAAI 2019, pages 3027–3035, 2019.
-  Christopher Hidey and Kathy McKeown. Identifying causal relations using parallel wikipedia articles. In Proceedings of ACL 2016, 2016.
-  Zhiyi Luo, Yuchen Sha, Kenny Q. Zhu, Seung-won Hwang, and Zhongyuan Wang. Commonsense causal reasoning between short texts. In Proceedings of KR 2016, pages 421–431, 2016.
Hongming Zhang, Xin Liu, Haojie Pan, Yangqiu Song, and Cane Wing-Ki Leung.
ASER: A large-scale eventuality knowledge graph.In Proceedings of WWW 2020, pages 201–211, 2020.
-  Danfei Xu, Yuke Zhu, Christopher B. Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. In Proceedings of CVPR 2017, pages 3097–3106, 2017.
-  Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. Graph R-CNN for scene graph generation. In Proceedings of ECCV 2018, pages 690–706, 2018.
-  Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. Factorizable net: An efficient subgraph-based framework for scene graph generation. In Proceedings of ECCV 2018, pages 346–363, 2018.
-  Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of CVPR 2015, pages 961–970, 2015.
-  Joseph Reisinger and Raymond J. Mooney. A mixture model with sharing for lexical semantics. In Proceedings of EMNLP 2010, pages 1173–1182, 2010.
Felix Hill, Roi Reichart, and Anna Korhonen.
Simlex-999: Evaluating semantic models with (genuine) similarity estimation.Computational Linguistics, 41(4):665–695, 2015.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Proceedings of NIPS 2017, pages 5998–6008, 2017.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019, pages 4171–4186, 2019.
-  Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of CVPR 2017, pages 1080–1089, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of CVPR 2016, pages 770–778, 2016.
-  Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
-  Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander H. Miller. Language models as knowledge bases? In Proceedings of EMNLP-IJCNLP 2019, pages 2463–2473, 2019.
-  Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
-  Bernhard Schölkopf. Causality for machine learning, 2019.
-  Amy Sue Fire and Song-Chun Zhu. Learning perceptual causality from video. ACM TIST, 7(2):23:1–23:22, 2016.
-  Kiana Ehsani, Hessam Bagherinezhad, Joseph Redmon, Roozbeh Mottaghi, and Ali Farhadi. Who let the dogs out? modeling dog behavior from visual data. In Proceedings of CVPR 2018, pages 4051–4060, 2018.
-  Jinyoung Yeo, Gyeongbok Lee, Gengyu Wang, Seungtaek Choi, Hyunsouk Cho, Reinald Kim Amplayo, and Seung-won Hwang. Visual choice of plausible alternatives: An evaluation of image-based commonsense causal reasoning. In Proceedings of LREC 2018, 2018.
-  Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron C. Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of ICML 2015, pages 2048–2057, 2015.
-  Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: visual question answering. In Proceedings of ICCV 2015, pages 2425–2433, 2015.
-  Mark Yatskar, Vicente Ordonez, and Ali Farhadi. Stating the obvious: Extracting visual common sense knowledge. In Proceedings of NAACL-HLT 2016, pages 193–198, 2016.
-  Shaohua Yang, Qiaozi Gao, Sari Saba-Sadiya, and Joyce Yue Chai. Commonsense justification for action explanation. In Proceedings of EMNLP 2018, pages 2627–2637, 2018.
-  Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. From recognition to cognition: Visual commonsense reasoning. In Proceedings of CVPR 2019, pages 6720–6731, 2019.
-  Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, and Yejin Choi. Visualcomet: Reasoning about the dynamic context of a still image. In Proceedings of ECCV 2020, pages 508–524, 2020.