Constructing Hierarchical Q A Datasets for Video Story Understanding

04/01/2019 ∙ by Yu-Jung Heo, et al. ∙ HONGIK UNIVERSITY NAVER Corp. Seoul National University 0

Video understanding is emerging as a new paradigm for studying human-like AI. Question-and-Answering (Q A) is used as a general benchmark to measure the level of intelligence for video understanding. While several previous studies have suggested datasets for video Q A tasks, they did not really incorporate story-level understanding, resulting in highly-biased and lack of variance in degree of question difficulty. In this paper, we propose a hierarchical method for building Q A datasets, i.e. hierarchical difficulty levels. We introduce three criteria for video story understanding, i.e. memory capacity, logical complexity, and DIKW (Data-Information-Knowledge-Wisdom) pyramid. We discuss how three-dimensional map constructed from these criteria can be used as a metric for evaluating the levels of intelligence relating to video story understanding.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

In narratology, story is often differentiated from discourse, where story refers to content (i.e., what to tell) and discourse denotes expression or representation (i.e., how to tell it)[Chatman1978, Genette1980]. While the representation can vary from text and oral storytelling to films, dramas, and virtual environments including virtual reality (VR), understanding of the given story shares some common key aspects regardless of the represented media.

According to computational linguists, narrative theorists, and cognitive scientists, narrative understanding is somehow linked with the measurement of reader’s intelligence. For example, readers can understand story as a way of problem solving in which they keep focusing on how main characters overcome coming obstacles throughout story [Black and Bower1980]. Thus readers, while reading, make inferences both in prospect and in retrospect about what events will occur and how these events could occur, considering the causal relationships between different events in the story [Trabasso and Van Den Broek1985, McKoon and Ratcliff1992, Graesser, Singer, and Trabasso1994]. Inferring causal relationships between events, is a key element for the reader to reconstruct a given narrative as a mental model in the reader’s mind [Zwaan1999, Zwaan, Langston, and Graesser1995]. Humans have the natural capability of “organizing our experience into narrative form” as narrative intelligence [Blair and Meyer1997, Mateas and Sengers1999].

Recently, video story serves as a testbed of real-world data to construct human-level AI from two points of view. First, video data has various modalities such as sequence of images, audios (including dialogue, sound effects, background music), and often texts (subtitles or added comments). Second, the video shows a cross-section of everyday life. Understanding video story involves analyzing and simulating human vision, language, thinking, and behavior, which is a significant challenge to current machine learning technology.

To measure human-level machine intelligence, we apply video Question-and-Answering (video Q&A) task as a proxy of video story understanding. The task can be regarded as a Turing Test for video story understanding[Turing1950]. While several previous studies have suggested various datasets for the video Q&A task[Tapaswi et al.2015, Kim et al.2017, Mun et al.2017, Jang et al.2017, Lei et al.2018], they are built without careful consideration of “understanding of video story”. For such reason, the previously released video Q&A datasets are highly-biased and lack of variance in question difficulty. The construction of Q&A dataset with hierarchical difficulty levels in terms of story understanding is crucial, as people with different perspectives (or different intelligence levels) will understand the given video story differently.

In this paper we propose three criteria such as memory capacity, logical complexity, and DIKW hierarchy for video story understanding and construct a three-dimensional hierarchical map of video story understanding using the criteria. The constructed three-dimensional map can leverage the understanding of developmental stages of human intelligence. We expect that the proposed hierarchical criteria can be utilized later as a metric that can help evaluate the levels of intelligence relating to video story understanding. Our main contributions are twofold. First, we suggest three criteria for constructing hierarchical video Q&A datasets. The criteria can be used to analyze the quality of video Q&A dataset in terms of bias and variance for dataset difficulty. Second, we interlink proposed three criteria to neo-Piagetian’s theory, which can help interpreting our story-enabled intelligence to cognitive development stage of human.

Related Works

While video understanding is still in its early stage, researchers proposed video Question-and-Answering (video Q&A) dataset as a general benchmark to measure video understanding intelligence. The most notable datasets proposed so far are MovieQA[Tapaswi et al.2015], PororoQA[Kim et al.2017], MarioQA[Mun et al.2017], TGIF-QA[Jang et al.2017], and TVQA[Lei et al.2018]. Here, we review above video Q&A datasets and present our contributions.

MovieQA aims to evaluate story understanding of video and text in movie. For the MovieQA, question and answer pairs are collected by annotators who read plot synopses of movies instead of the entire movies. PororoQA is comprised of targeted animation videos, which makes its content easier to understand than MovieQA dataset. MarioQA dataset is also based on synthetic videos constructed automatically from the popular Mario game playing videos. The dataset focuses on understanding of temporal relationship between multiple events. When the dataset is generated, template-based question and answer generation methods are used from extracted events. TGIF-QA dataset focuses on only visual information in the GIF-format images. TGIF-QA limits the question type to three types: repetition count, repeating action, and state transition. Those types of questions are required spatio-temporal reasoning from videos. TVQA question is a large-scale video QA dataset based on 6 popular TV shows about sitcoms, medical and crime TV programs. For the TVQA, all questions and answers are attached to 60-90 seconds video clips. It requires comprehension for subtitle-based dialogue and recognition of relevant visual concepts to answer the questions in the dataset.

Our work contributes to this line of research, but instead of introducing new datasets, here we propose new criteria for constructing video Q&A dataset on careful consideration of video story understanding.

Three-dimensional Video Q&A Hierarchy

This section describes three criteria as measures of video story understanding. The three criteria are as follows: memory capacity, logical complexity, and DIKW (Data-Information-Knowledge-Wisdom) hierarchy. These three criteria are combined to construct a three-dimensional video understanding map. Every question in the Q&A dataset is classified in each level by each criterion respectively, and then represented as a point on the three-dimensional map which has three dimensions corresponding to the three criteria. Finally, every point in the map is assigned to the cognitive development stage according to Piaget’s theory

[Piaget1972, Collis1975b]. We explain this process in detail in the following subsections.

Criterion 1: Memory Capacity

When determining the difficulty of questions collected for the video, the length of the video is crucial for reasoning and finding the correct answer in machine learning perspective. If the length of the video required for answering the question is longer, the question can be classified as more difficult, and vice versa. For example, a question targeted to short video is a lot more difficult than a question targeted to an image frame, and a question targeted to entire video is a lot more difficult than a question targeted to one segment video. This criterion also can be interpreted as memory capacity of humans. In this paper, We define Memory Capacity as the length of the target video which has to be considered to answer given question. We use the terms defined at each level consistently with the terms in [Zhai and Shah2006]. The classification results are as follows.

  • Level 1 (frame): The questions for this level are based on a video frame. This level has the same difficulty as that of a kind of Visual Question Answering (VQA) dataset[Malinowski, Rohrbach, and Fritz2015, Ren, Kiros, and Zemel2015, Agrawal et al.2017, Zhu et al.2016, Johnson et al.2017, Wang et al.2018].

  • Level 2 (shot): The questions for this level are based on a video length less than about 10 seconds without change of viewpoint. This set of questions can contain atomic or functional/meaningful action in the video. Most recent datasets which deal with video belong to this level[Jang et al.2017, Maharaj et al.2017, Mun et al.2017]. The questions of this level aim to evaluate understanding of information which contains video characteristics that are not in Level 1 (frame level). One important point is that target video for questions at this level contains meaningful actions. At this level, both atomic action and meaningful action can appear, and their boundary is vague. For example, waving hands (atomic action) and a gesture to saying goodbye (meaningful action) have a similar action, However, their meaning is different depending on the situation, not depending on video length. This level contains both of actions, even if their difficulty is clearly different.

  • Level 3 (scene): The set of questions for this level is based on clips that are 1-3 minutes long without place change. Videos at this level contain sequences of actions, which augment the level of difficulty from Level 2. We consider this level as the “story” level according to our working definition of story. MovieQA[Tapaswi et al.2015] and TVQA[Lei et al.2018] are the only datasets which belong to this level. For example, the popular TV sitcom Friends has 13 scenes per episode on average, and a movie has 120 scenes on average.

  • Level 4 (sequence): The set of questions at this level is related to more than two scenes, but less than entire movie. To the best of our knowledge, there are no datasets dealing with the video at this level.

  • Level 5 (entire): The question set for this level is based on an entire story from beginning to end. Questions at this level are based on whole video such as an entire movie or an episode of a drama.

Criteria 2: Logical Complexity

Complicated questions often require more (or higher) logical reasoning steps than simple questions. In other words, if a question requires multiple supporting facts which have interrelations to answer, we regard that the question has high logical complexity. For story-enabled intelligence, it is required to trace several logical reasoning steps by combining multiple supporting facts to give a correct answer to a given question. In a similar vein, if a question needs only a single supporting fact with a single relevant datum, we regard that it has low logical complexity. It may need only one reasoning step or one perception step to answer the question.

This subsection describes the second criterion logical complexity to define the level of difficulties for questions. We define five logical complexity levels based on the Stanford Mobile Inquiry-based Learning Environment(SMILE)[Seol, Sharp, and Kim2011]. In the SMILE project, students learn online lectures or documents via a mobile platform and generate relevant questions based on what they have learned. Each question made by a student is classified into five logical complexity levels as follows:

  • Level 1 (Simple recall on one cue) : The question set at this level can be responded with minimal cognitive effort, involving simple recall or simple arithmetic calculations. The questions at this level require only one supporting fact. Supporting fact is a triplet form of {subject-relationship-object} such as {person-hold-cup}. As the questions at this level are too simple, they may not trigger much interaction.

  • Level 2 (Simple analysis on multiple cues) : The question set at this level can be responded with simple analysis of the question types or problems with simple reasoning. The questions at this level ask for factual information involving recall of independent multiple supporting facts, which trigger simple inference or quick interpretation. For example, two supporting facts {tom-in-kitchen} and {tom-grab-tissue} are referenced to answer “Where does Tom grab the tissue?”. This question set begins with simple question types starting with “Who”, “What”, “When”, “Where”, “How many”, and so on. Responses come from a range of clearly defined scope with little room for dispute.

  • Level 3 (Intermediate cognition on dependent multiple cues) : The question set can be responded with intermediate level of cognition and analysis. The questions at this level require multiple supporting facts with time factor. Time factor is a sequence of the situations or actions. Accordingly, the questions at this level cover how situations have changed and subjects have acted. It also requires cognitive operations such as comparison, classification, or categorization in responding to given questions at this level.

  • Level 4 (High-level reasoning for causality) : The question set at this level can be responded with higher-level of analysis and reasoning rather than a lower-level thinking question. The question set covers reasoning for causality beginning with “Why”. Reasoning for causality is the process of identifying causality, which is the relationship between cause and effect from actions or situations. It requires own interpretation or synthesis in responding to given questions at this level.

  • Level 5 (Creative thinking) : The question set at this level can be responded by requiring imagination and creation of new theory or hypothesis with supporting rationale. The question at this level covers creative thinking and reasoning that may help defining a new solution or concept that has not existed previously. For example, the questions (#19 and #20) in Table 1 on appendix draw an unique solution by formulating own rational equations about not occurred situations.

Criterion 3: DIKW Hierarchy

The DIKW (Data, Information, Knowledge, and Wisdom) hierarchy is widely accepted as a way of representing different levels of what we see and what we know [Schumaker2011]. In terms of video Q&A, a level of understanding can be also identified by answering questions based on different levels ranging from data, information, knowledge, and wisdom. In this section, details of the four levels of video understanding are discussed.

  • Level 1 (Data-level): Data are the observations of the physical world [Schumaker2011, Carlisle2006] and are symbolic representation of things, events and activities[Ackoff1989, Rowley2007]. In terms of video Q&A, the data-level covers the questions for characters, characters’ lines, objects, sounds, locations and simple behaviors which have no subjective meaning or goal for the environment (such as standing, walking, calling, and etc.).

  • Level 2 (Information-level) : Information refers to the data that have been shaped into a meaningful and useful form [Rowley2007]. Specifically, it includes the addition of relationships between data [Barlas, Ginart, and Dorrity2005]. In terms of video Q&A, information-level questions focus on the meaningful interaction between characters and objects such as actions, emotions, and situations that can be obtained from the scene of the video.

  • Level 3 (Knowledge-level) : Knowledge refers to the aggregation of related information that provides a clear understanding of information [Barlas, Ginart, and Dorrity2005, Schumaker2011]. Knowledge also involves the synthesis of multiple sources of information over time [Rowley2007, Despres and Chauvel2012]. In terms of video Q&A, knowledge-level questions can be answered only with accumulated information of multiple scenes of the video including knowledge from fictional universe of contents and commonsense.

  • Level 4 (Wisdom-level): Wisdom refers to accumulated knowledge, with which it is possible to apply understood concepts from one domain to new situations or problems [Rowley2007]. In terms of video Q&A, the wisdom-level questions can be answered by utilizing useful meta relationship of knowledge including nonsense and humor. For example, the question #28 in Table 1 on appendix needs to understand the character “Chandler” in terms of a sense of humor.

Figure 1: Interpretation of the proposed three criteria (i.e., Memory capacity, Logical complexity, and DIKW Hierarchy) as cognitive development stage proposed by Piaget and recasted by Collis. The highlighted bar means the possibility to apply cognitive operations for answering given question on each level of a criterion from each cognitive developmental stage.

Interpretation as Cognitive Development Stage

In the following section, we interpret proposed three criteria (i.e., memory capacity, logical complexity, and DIKW pyramid) from the viewpoint of cognitive development of human intelligence. The detailed cognitive development defined by Piaget is introduced, and then we apply the cognitive development stage to criteria of three-dimensional video Q&A hierarchy.

Piaget’s theory of cognitive development

In this section, we explain cognitive development of human based on one of Neo-Piagetian theory[Collis1975b] recasting of Piaget’s theory of developmental stages[Piaget1972]. Piaget’s theory explains in detail the process by which human cognitive ability develops, in conjunction with information processing models. In order to justify three criteria proposed in this paper in terms of human intelligence development, we examine the details of the developmental stages of Piaget’s theory. Piaget’s original model suggests a sensory-motor stage that occurs from birth, however, that stage involves only representations related to sensory-motor activity. Thus, we focus on the later stages that follow the pre-operational stage in which a child shows understanding behavior[Collis1975a].

  • Stage 1 (Pre-Operational Stage; 4 to 6 years) : At this stage, a child thinks at a symbolic level, but is not yet using cognitive operations. The child can not transform, combine or separate ideas. Thinking at this stage is not logical and often unreasonable. Associations are made on the basis of emotion and preference at this stage, and it has a very egocentric sight of one’s own world.

  • Stage 2 (Early Concrete Stage; 7 to 9 years) : At this stage, a child can utilize only one relevant operation. Thinking at this stage has become detached from instant impressions and is structured around a single mental operation, which is a first step towards logical thinking.

  • Stage 3 (Middle Concrete Stage; 10 to 12 years) : At this stage, a child can think by utilizing more than two relevant cognitive operations and acquire the facts of dialogues. This is regarded as the foundation of proper logical functioning. However, a child at this stage lacks own ability to identify general fact that integrates relevant facts into coherent one. Moreover, thinking at this stage is still concrete, not abstract.

  • Stage 4 (Concrete Generalization Stage; 13 to 15 years) : Piaget referred to this stage as the early formal stage, particularly for abstract thinking. A child at this stage, however, can just generalize only from personal and concrete experiences. The child do not have own ability to hypothesize possible concepts or knowledge that is quite abstract.

  • Stage 5 (Formal Stage; 16 years onward) : This stage is characterized purely by abstract thought. Rules can be integrated to obtain novel results that are beyond the individual’s own personal experiences. However, this is not a stage that every person can reach.

Figure 2: Example of three-dimensional video Question-and-Answering (video Q&A) hierarchy. Each point represents each question in Table 1 on appendix. Three coordinates of each point is assigned by the definition of level of three criteria, and matched to one developmental stage which is the highest stage among derived three stages following the interpretation of Figure 1.

Applying Piaget’s human developmental stage to criteria of three-dimensional hierarchy

Human understanding, as Piaget stated, can be classified into different stages. We propose that the concepts from Piaget’s theory of development correspond to the three-dimensional video Q&A Hierarchy criteria.

First, the development stage can be explained from the perspective of the memory capacity criterion. Memory capacity corresponds to working memory of the cognitive process model. [Case1980a] suggested that the working memory available for problems increases with age, as does the space required for higher level responses. This relationship between working memory and age leads to the proposition that cognitive developmental stages can be explained by increasing attention span, or working memory capacity[Case1980b, Mclaughlin1963, Pascual-Leone1969]. Thus, we assume that Piaget’s theory of development of human intelligence with age can be in accordance with the memory capacity criterion. For example, understanding a static image can be understood to be from Stage 1 (Pre-operational). Also, understanding video within 10 seconds is possible from Stage 1. However, beyond minutes(e.g., understanding a scene within 3 minutes), is possible from Stage 2 (Early concrete). Beyond this, understanding two or more scenes (e.g., understanding sequences changing time and place) is possible from Stage 3 (Middle concrete). Finally, it is possible from Stage 4 (Concrete generalization) to know and understand whole video entirely. While this mapping is not exactly distinct, there exists a clear hierarchy as shown in Figure 1.

Piaget’s developmental stages are also consistent with the logical complexity criterion. As the SMILE project proposes, from simple recall to assumption-based reasoning, methods have a kind of hierarchy that is closely related to a person’s stage of development. For example, level 1 and level 2 is available from Stage 1 (Pre-operational) in that it needs a simple call. Specifically, level 1 requires one supporting fact (e.g., {jacket-is-black}), on the other hand, level 2 requires independent multiple supporting facts. Level 3 is available from Stage 3 (Middle concrete), in that this level can be understood using dependent multiple supporting facts across time. Level 4 is available from Stage 4 (Concrete generalization), because this level requires a higher thought on causality in relation to “Why”. Finally, level 5 is available from Stage 5 (Formal Stage), as it requires creativity and abstract thinking about new ideas. As such, each phase of SMILE can be expressed as roughly equivalent to the human developmental stage postulated by Piaget[Collis1972].

Piaget’s human developmental stages are not exactly consistent with the DIKW hierarchy criterion. However, data-level is possible from the Stage 1 (Pre-operational) in terms of providing simple factual data. Information-level is possible from Stage 3 (Middle concrete), because it identifies a relationship between some real world entities. Knowledge-level is roughly equivalent from Stage 4 (Concrete generalization), as both are incapable of inferring information from abstract variables. Finally, wisdom-level in DIKW hierarchy criterion is possible from Stage 5 (Formal stage), in that it can be inferred and applied to new situations like knowledge transferring[Collis1975a]. Figure 2 shows three-dimensional hierarchical map for each question represented in Table 1 on appendix. Three coordinates of each point are assigned by the definition of level of three criteria, and matched to one developmental stage which is the highest stage among derived three stages following the interpretation of Figure 1. For example, a question “What did Joey and Chandler do at Ross house?” is set {3, 2, 1} level for three criteria. Each level can be developed from {2, 1, 1} stage, so that the question is mapped Stage 2 which is the highest stage among {2, 1, 1} stage.

Discussion and Future Works

In this paper, we propose a theoretical framework with three criteria (i.e., memory capacity, logical complexity, and DIKW hierarchy) to construct a hierarchical Q&A dataset for video story understanding. A key contribution of our work is to suggest an approach to classify the difficulties of questions for video story understanding in accordance with three criteria. Interestingly, the three criteria can be mapped with the five stages of human development in Piaget’s theory, which can serve as a basis for Q&A systems to evaluate video story understanding.

While suggested three criteria can be linked to human developmental stages, it is hard to define separately and exactly. This is due to the limitation of Piaget’s theory, where the human development stage cannot be discontinuous. That is, each developmental stage is not always precisely differentiated by age and cognitive abilities. Especially, the agreement for development stage of the high-level cognition is still controversial. Nevertheless, Collis’ stages[Collis1972, Collis1975a, Collis1975b] are the appropriate attempt to classify understanding according to human development stages. Furthermore, the connection from the three criteria to five developmental stages presents the possibility that our story-enabled intelligence can be associated to cognitive development stages of human. Applying knowledge about human cognitive development will help to set the direction of human-level AI research in detail.

Moreover, comparing with human development stages, machine learning approach requires clear and explicit specification. For example, in Collis’ classification, each stage has an approximate two-year interval, while the machine needs to be viewed with more detailed and specific classification criterion - such as either every month or every season.

As future work, we plan to extend the proposed criteria to reflect some viewpoints from cognitive narratology (e.g., Zwaan’s five index model of narrative understanding including space, time, characters, goals, and causation[Zwaan1999]). We also plan to use our proposed criteria as a guideline for video story understanding to construct a carefully designed hierarchical dataset. The proposed video Q&A hierarchy can be used as a metric for the developmental level of machine intelligence, and as a guidance to what dataset should be collected to study the desired level of machine intelligence.

Acknowledgments.

The authors would like to thank Woosuk Choi and Chris Hickey for helpful comments and editing. This work was partly supported by the Korea government (No.2017-0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test and IITP-R0126-16-1072-SW.StarLab, 2018-0-00622-RMI, KEIT-10060086-RISF, NRF-2016R1D1A1B03936326)

References

  • [Ackoff1989] Ackoff, R. L. 1989. From data to wisdom. Journal of applied systems analysis 16(1):3–9.
  • [Agrawal et al.2017] Agrawal, A.; Lu, J.; Antol, S.; Mitchell, M.; Zitnick, C. L.; Parikh, D.; and Batra, D. 2017. VQA: Visual Question Answering: www.visualqa.org.

    International Journal of Computer Vision

    123(1):4–31.
  • [Barlas, Ginart, and Dorrity2005] Barlas, I.; Ginart, A.; and Dorrity, J. L. 2005. Self-evolution in knowledge bases. In Autotestcon, 2005. IEEE, 325–331. IEEE.
  • [Black and Bower1980] Black, J., and Bower, G. 1980. Story understanding as problem solving. Poetics 9:223–250.
  • [Blair and Meyer1997] Blair, D., and Meyer, T. 1997. Tools for an interactive virtual cinema. Berlin, Heidelberg: Springer Berlin Heidelberg. 83–91.
  • [Carlisle2006] Carlisle, J. P. 2006. Escaping the veil of maya: Wisdom and the organization. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06), volume 7, 162a–162a.
  • [Case1980a] Case, R. 1980a. Implications of neo-piagetian theory for improving the design of instruction. Cognition, development, and instruction 161–186.
  • [Case1980b] Case, R. 1980b. The underlying mechanism of intellectual development. Cognition, development, and instruction.
  • [Chatman1978] Chatman, S. 1978. Story and Discourse: Narrative Structure in Fiction and Film. Cornell paperbacks. Cornell University Press.
  • [Collis1972] Collis, K. 1972. Concrete to abstract–a new viewpoint. Australian Mathematics Teacher 28(3):113–118.
  • [Collis1975a] Collis, K. F. 1975a. The development of formal reasoning. University of Newcastle.
  • [Collis1975b] Collis, K. F. 1975b. A Study of Concrete and Formal Operations in School Mathematics: A Piagetian Viewpoint. Hawthorn Vic : Australian Council for Educational Research.
  • [Despres and Chauvel2012] Despres, C., and Chauvel, D. 2012. Knowledge horizons. Routledge.
  • [Genette1980] Genette, G. 1980. Narrative Discourse: An Essay in Method. Cornell paperbacks. Cornell University Press.
  • [Graesser, Singer, and Trabasso1994] Graesser, A. C.; Singer, M.; and Trabasso, T. 1994. Constructing inferences during narrative text comprehension. Psychological Review 101:371–395.
  • [Jang et al.2017] Jang, Y.; Song, Y.; Yu, Y.; Kim, Y.; and Kim, G. 2017. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering.
  • [Johnson et al.2017] Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Zitnick, C. L.; and Girshick, R. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In

    Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on

    , 1988–1997.
    IEEE.
  • [Kim et al.2017] Kim, K.-M.; Heo, M.-O.; Choi, S.-H.; and Zhang, B.-T. 2017. Deepstory: Video story QA by deep embedded memory networks. In

    IJCAI International Joint Conference on Artificial Intelligence

    .
  • [Lei et al.2018] Lei, J.; Yu, L.; Bansal, M.; and Berg, T. L. 2018. Tvqa: Localized, compositional video question answering. In EMNLP.
  • [Maharaj et al.2017] Maharaj, T.; Ballas, N.; Rohrbach, A.; Courville, A.; and Pal, C. 2017. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 2017-January:7359–7368.
  • [Malinowski, Rohrbach, and Fritz2015] Malinowski, M.; Rohrbach, M.; and Fritz, M. 2015.

    Ask your neurons: A neural-based approach to answering questions about images.

    In Proceedings of the IEEE international conference on computer vision, 1–9.
  • [Mateas and Sengers1999] Mateas, M., and Sengers, P. 1999. Narrative intelligence. In Proceedings of the AAAI Fall Symposium on Narrative Intelligence, 1–10.
  • [McKoon and Ratcliff1992] McKoon, G., and Ratcliff, R. 1992. Inference during reading. Psychological Review 99(3):440–466.
  • [Mclaughlin1963] Mclaughlin, G. H. 1963. Psycho-logic: A possible alternative to piaget’s formulation. British Journal of Educational Psychology 33(1):61–67.
  • [Mun et al.2017] Mun, J.; Seo, P. H.; Jung, I.; and Han, B. 2017. Marioqa: Answering questions by watching gameplay videos. In ICCV.
  • [Pascual-Leone1969] Pascual-Leone, J. 1969. Cognitive development and cognitive style: A general psychological integration.
  • [Piaget1972] Piaget, J. 1972. Intellectual evolution from adolescence to adulthood. Human development 15(1):1–12.
  • [Ren, Kiros, and Zemel2015] Ren, M.; Kiros, R.; and Zemel, R. 2015. Exploring models and data for image question answering. In Advances in neural information processing systems, 2953–2961.
  • [Rowley2007] Rowley, J. 2007. The wisdom hierarchy: representations of the dikw hierarchy. Journal of information science 33(2):163–180.
  • [Schumaker2011] Schumaker, R. P. 2011. From data to wisdom: the progression of computational learning in text mining. Communications of the IIMA 11(1):4.
  • [Seol, Sharp, and Kim2011] Seol, S.; Sharp, A.; and Kim, P. 2011. Stanford mobile inquiry-based learning environment (smile): using mobile phones to promote student inquires in the elementary classroom. In Proceedings of the International Conference on Frontiers in Education: Computer Science and Computer Engineering (FECS),  1. The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp).
  • [Tapaswi et al.2015] Tapaswi, M.; Zhu, Y.; Stiefelhagen, R.; Torralba, A.; Urtasun, R.; and Fidler, S. 2015. MovieQA: Understanding Stories in Movies through Question-Answering.
  • [Trabasso and Van Den Broek1985] Trabasso, T., and Van Den Broek, P. 1985. Causal thinking and the representation of narrative events. Journal of memory and language 24(5):612–630.
  • [Turing1950] Turing, A. M. 1950. Computing machinery and intelligence. Mind 59(236):433–460.
  • [Wang et al.2018] Wang, P.; Wu, Q.; Shen, C.; Dick, A.; and van den Hengel, A. 2018. Fvqa: Fact-based visual question answering. IEEE transactions on pattern analysis and machine intelligence 40(10):2413–2427.
  • [Zhai and Shah2006] Zhai, Y., and Shah, M. 2006.

    Video scene segmentation using markov chain monte carlo.

    Multimedia, IEEE Transactions on 8:686 – 697.
  • [Zhu et al.2016] Zhu, Y.; Groth, O.; Bernstein, M.; and Fei-Fei, L. 2016. Visual7w: Grounded question answering in images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4995–5004.
  • [Zwaan, Langston, and Graesser1995] Zwaan, R. A.; Langston, M. C.; and Graesser, A. C. 1995. The construction of situation models in narrative comprehension: An event-indexing model. Psychological Science 6(5):292–297.
  • [Zwaan1999] Zwaan, R. A. 1999. Situation models: The mental leap into imagined worlds. Current Directions in Psychological Science 8(1):15–18.

Appendix.