To understand the visual scenes is one of the ultimate goals in computer vision. A lot of intermediate and low-level tasks, such as object detection, recognition, segmentation and tracking, have been studied towards this goal. One of the high-level tasks towards scene understanding is the visual question answering[Antol et al.2015]. This task aims at understanding the scenes by answering the questions about the visual data. It also has a wide application, from aiding the visually-impaired, analyzing surveillance data to domestic robots.
The visual data we are facing everyday are mostly dynamic videos. However, most of the current visual question answering works focus only on images [Bigham et al.2010, Geman et al.2015, Gao et al.2015, Yang et al.2016, Noh et al.2016, Ma et al.2016]. The images are static and contain far less information than the videos. The task of image-based question answering cannot fit into real-world applications since it ignores the temporal coherence of the scenes.
Existing video-related question answering works usually combine additional information. The Movie-QA dataset [Tapaswi et al.2016] contains multiple sources of information: plots, subtitles, video clips, scripts and DVS transcriptions. These extra information is hard to retrieve in the real world, making these datasets and approaches difficult to extend to general videos.
Unlike the previous works, we consider the more ubiquitous task of video question answering with only the visual data and the natural language questions. In our task, only the videos, the questions and the corresponding answer choices are presented. We first introduce a dataset collected on our own. To collect a dataset is not an easy task. In image-based question answering (Visual-QA) [Antol et al.2015], most current collection methods require humans to generate the question-answer pairs [Antol et al.2015, Malinowski and Fritz2014]. This requires a significant amount of human labor. To make things worse, the video has a temporal dimension compared with the image, which implies that the labor of the human annotators is multiplied. To avoid the significant increase of human labor, our solution is to employ the question generation approaches [Heilman and Smith2010] to generate question-answer pairs directly from the texts accompanying the videos. Now the collection becomes collecting videos with descriptions. This inspires us to utilize the existing video description datasets. The TGIF (Tumblr GIF) dataset [Li et al.2016] is a large-scale video description dataset. The groundtruth description provides us the necessary texts to produce the question-answer pairs. Finally, we form our TGIF-QA dataset. Details will be described in the Dataset section.
Existing visual question answering approaches are not suitable for video question answering since the video question answering has the following features: First, the question may relate to an event which happens across multiple frames. The information must be gathered among the frames to answer the question. A typical case is the question related to the numbers (see in Fig 1.). The question asks about the number of men in the video. However, in the beginning, we cannot see the correct number of men. Only by watching the video frame by frame can we tell the answer is four. Same is the case in the top-right of Fig 2. Current existing visual-QA approaches cannot be applied as they only utilize the spatial information of a static image. Second, there may be a lot of redundancy in the video frames. In addition to these, our task faces another challenge: the candidate answers are mostly phrases. To tackle these problems, we propose two models: the re-watcher and the re-reader. The re-watcher model meticulously processes the video frames. This model allows to gather information from relevant frames. Then the information is recurrently accumulated as the question is read. The re-reader model can handle phrasal answers and concentrate on the important words in the answers. Then we combine these two models into the forgettable-watcher model.
Our contribution can be summarized into two aspects: First we introduce a Video-QA dataset TGIF-QA. Second we propose the models which can employ the temporal structure of the videos and the phrasal structure of the candidate answers. We also extend the VQA model [Antol et al.2015] as a baseline method. We evaluate our models on our proposed TGIF-QA dataset. The experimental results show the effectiveness of our proposed models.
In this section, we introduce our dataset for Video Question Answering. We convert the existing video description dataset for question answering. The TGIF dataset [Li et al.2016] is collected by Li et al. from Tumblr. GIFs are almost identical to small video clips for the short duration. Li et al. cleaned up the original data and ruled out the GIFs with catoon, static and textual content. The animated GIFs were later annotated using crowd-sourcing service. The TGIF dataset contains 102,068 GIFs and 287,933 descriptions in total where each GIF corresponds to several descriptions. Each description consists of one or two sentences.
In order to generate the question-answer pairs from the descriptions, we employ the state-of-the-art question generation approach [Heilman and Smith2010]. We focus on the questions of types What/When/Who/Whose/Where and How Many. Our question answering task is of the multiple-choice type, and the generated data only contains question and ground-truth answer pairs. So we need to generate wrong alternatives for each question. We provide each question with 8 candidate answers. We describe how we generate the alternative answers for each kind of questions in the following subsections.
2.1.1 How Many
The How Many question relates to counting some objects in the video. In order to generate reasonable alternatives, we first collect all the How Many questions in our dataset and gather the numbers in their answers. All the Arabic numerals are converted to English words representations. After eliminating the numbers of low occurrence frequency, we find that most answers contain numbers from one to eight. We discard the questions whose answers exceed eight and replace the ground-truth numbers with [one, two, three, four, five, six, seven, eight] to generate the 8 candidate answers. One typical example is shown in the top-right of Fig 1.
The questions starting with Who usually relate to humans. We collect the words in the answers from all the Who questions. Then we filter the words to obtain all the nouns. After that, we filter out all the abstract and low-frequency nouns to form an entity list. The entity word in the ground-truth answers is selected and replaced with random samples from the entity list to generate 8 alternatives. An example is provided in the top-left of Fig 1.
The Whose questions relate to the facts about belongings. There are two ways to represent the belongings. One is through the possessive pronoun like “my”, “your”, “his”, etc. The other is to use the possessive case of nouns such as “man’s”, “girl’s”, “cats’”, etc. For the former kind of possessive pronouns, we replace the pronouns with random samples from the possessive pronoun list to generate the alternatives. For the latter one we replace the nouns just the same way we do for the Who questions.
2.1.4 Other Questions
For the rest types of questions, we just replace the nouns in the answer phrases to generate candidate choices.
2.2 Dataset Split
We abandon the videos whose descriptions generate no questions or the questions have been discarded in the previous processing. In the end our dataset contains 117,854 videos and 287,763 question-candidates pairs. We split our TGIF-QA dataset into three parts for training, validation and testing. The training dataset contains 230,689 question-answers pairs from 94,425 videos. The validation dataset contains 24,696 pairs from 10,096 videos and the testing dataset has 32,378 pairs from 13,333 videos.
3.1 Task description
Multiple-Choice Video-QA is to select an answer given a question , video information and candidate choices (alternatives) . A video is a sequence of image frames . A question is a sequence of natural language words . Each alternative of the candidate answers is also a sequence of natural language words
. We formulate the Video-QA task as selecting the best answer among the alternatives. In the other word, we define a loss function. We regard the QA problem as a classification problem and the best answer is selected when it achieves minimal loss .
The major difficulty of Video-QA compared with Image-QA is that the video has a lot of frames. The information is scattered among the frames and an event can last across the frames. To answer a question, we must find the most informative frame or combine the information of several frames.
In the following we propose three models: the re-watcher, the re-reader and the forgettable-watcher. The re-watcher model processes the question sentence word sequence meticulously. The re-reader model fully utilizes the information of the video frames along the temporal dimension. And the forgettable-watcher combines both of them.
We denote as the concatenation of question and answer after word embedding. All our models will take the question and one alternative as a whole sentence which we call the QA-sentence. The QA-sentence and the visual feature sequence are then put into our models to produce a score for the alternative answer in the sentence. In the following sections, we will denote the QA-sentence as when it does not introduce confusions.
Our model first encodes the video features and QA features with two separate bi-directional single layer LSTMs [Graves2012]. The LSTMs contain an embedding layer which maps the video features and textual features into a joint feature space. We denote the outputs of the forward and backward LSTMs as and . The encoding of a QA-sentence of length is formed by the concatenation of final outputs of the forward and backward LSTMs, .
For the video frames, the encoding output for each frame at time is . The representation of the videos for each QA-sentence token
is formed by a weighted sum of these output vectors (similar to the attention mechanism in Image-QA[Yang et al.2016]) and the previous representation :
The mechanism of the re-watcher model is that every word will combine with the whole video sequence to generate a state. Then the states of the word sequence is accumulated to generate a combined feature (see Fig 2. left). This model mimics a person who has a bad memory for the video he watches. Every time he reads a word of the QA-sentence, he goes back to watch the whole video to make out what the words in the sentence are about. During this procedure, he selects the information most related to the QA-sentence token from the video and then recurrently accumulate the information as the whole QA-sentence is read. Finally a joint video and QA-sentence representation is formed for producing a score. The score measures how much the question-answer pair (QA-sentence) matches with the video:
represents three fully-connected layers. The activation of the former two layers is ReLU and the last layer is without activation. Since each question has 8 alternatives, each question relates to 8 QA-sentences. For a question and 8 alternatives, we generate 8 such scores and they fill up a 8-dimensional score vector:
The score vector is then put through the softmax layer.
The re-watcher model mimics a person who continuously re-watches the video as he reads the QA-sentence. Video features related with the QA-sentence are accumulated. The re-reader model is designed from the opposite view (see Fig 2. middle).
This model mimics a person who cannot well remember the whole question. Every time he watches a frame, he retrospects on the whole question. We denote the encoding output of the QA-sentence at token as . The representation of the video frames at time is computed from the weighted sum of the QA-sentence encoding outputs and the previous representation :
where is the number of frames. The score of the QA-sentence is:
We combine the re-watcher and the re-reader models into the forgettable-watcher model (see Fig 2 right). This model meticulously combines the visual features and the texual features. The whole video is re-watched when a word is read and the whole sentence is re-read while a frame is being watched. Then the representations are combined to produce the score:
3.2.4 Baseline method
In order to show the effectiveness of the re-reading and the re-watching mechanisms of our models. We also employ a straightforward model (see Fig 3). This model is extended from the VQA model [Antol et al.2015]. The VQA model is designed for question answering given only a single image. We extend the model for our task by directly encoding the video frames and the QA-sentences with two separate bidirectional LSTMs. The final encoding outputs of both bidirectional LSTMs are then combined to produce the score.
4 Experiments and Results
We evaluate all the methods on our TGIF-QA dataset described in the Dataset section.
4.1 Data Preprocessing
We sample all the videos to have the same number of frames with the purpose of reducing the redundancy. If the video does not own enough frames for sampling, its last frame is repeated. We extract the visual features of each frame with VGGNet [Simonyan and Zisserman2014]. The 4096-dimensional feature vector of the first fully-connected layer is taken as the visual feature. For the questions and answers, the Word2Vec [Mikolov et al.2013] trained on GoogleNews is employed to embed each word as a 300-dimensional real vector. Then we concatenate each question with its 8 alternatives to generate 8 candidate QA-sentences.
4.2 Implementation Details
We train all our models using the Adam optimizer [Kingma and Ba2015]
to minimize the loss. The initial learning rate is set to 0.002. The exponential decay rates for the first and the second moment estimates are set to 0.1 and 0.001 respectively. The batch size is set to 100. A gradient clipping scheme is applied to clip gradient norm within 10. An early stopping strategy is utilized to stop training when the validation accuracy no longer improves.
4.2.2 Model Details
The visual features and the word embeddings are encoded by two separate bidirectional LSTMs to dimensionality 2048 and 1024 respectively. Then they are mapped to a joint feature space of dimensionality 1024. The re-watcher (re-reader) component keeps a memory of size 512 and outputs the final combined feature of dimensionality 512. Finally the combined feature is put into three fully-connected layers of size . We evaluate our Video-QA task using classification accuracy.
4.2.3 Evaluation Metrics
Since our task is the multiple-choice question answering, we employ the classification accuracy to evaluate our models. However, there are a few cases where both the two choices can answer the question. This motivates us to also apply the WUPS measure [Malinowski and Fritz2014] with and as the threshold values like the open-ended tasks.
|Accuracy||WUPS@0.0 (%)||WUPS@0.9 (%)|
4.3.1 Baseline method
The baseline method is an extension of the VQA model [Antol et al.2015] without our re-watching or re-reading mechanisms. We name it the straightforward model and its result is shown in the Straightforward section of table 1. We can see that the straightforward model performs much worse than the other three models.
4.3.2 Our methods
The forgettable-watcher model outperforms the other two models since it jointly employs the re-watching and the re-reading mechanisms. On the other hand, the re-reader model performs worse than the re-watcher model. This implies that the re-watching mechanism is more important.
4.3.3 Results on different questions
We also report the accuracy of the models on different types of questions. The result is shown in table 2. All the methods perform better on the questions asking about “Where” and “When” than others. This may be attributed to two reasons: First, the “Where” and “When” questions are easier to answer because these two questions usually relate to the video big scene. In most cases, a single frame may be enough to answer the questions. The other is that the candidate alternatives produced in the dataset generation may be too simple to discriminate. The dataset generation method is less effective in producing good alternatives for these two questions while it can produce high-quality alternatives for the other types of questions. We also report the results of the questions besides these two in table 2.
Finally, we exhibit some typical Video-QA results in Fig 5.
5 Related Work
Image-based visual question answering has attracted a significant research interest recently. [Bigham et al.2010, Geman et al.2015, Antol et al.2015, Gao et al.2015, Yang et al.2016, Noh et al.2016]. The goal of Image-QA is to answer questions given only one image without additional information. Image-QA tasks can be categorized into two types according to the ways answers are generated. The first type is called the Open-Ended question answering [Ren et al.2015, Malinowski and Fritz2014] where answers are produced given only the questions. As the answers generated by the algorithms are usually not the exact words people expect, measures such as the WUPS 0.9 and WUPS 0.0 based on the Wu-Palmer (WUPS) similarity [Malinowski and Fritz2014] are employed to measure the answer accuracy. The other type is called the Multiple-Choice question answering [Zhu et al.2016] where both the question and several candidate answers are presented. To predict the answer is to pick the correct one among the candidates. It is observed that the approaches for the Open-Ended question answering usually cannot produce high-quality long answers [Zhu et al.2016]. And most Open-Ended question answering approaches only focus on one-word answers [Ren et al.2015, Malinowski and Fritz2014]. As a result, we consider the Multiple-Choice question answering type for our video question answering task. Our candidate answer choices are mainly phrases rather than single words.
A lot of efforts have been made tackling the Image-QA problem. Some of them have collected their own datasets. Malinowski et al [Malinowski and Fritz2014] collected their dataset with the human annotations. Their dataset only focuses on basic colors, numbers and objects. Antol et al. [Antol et al.2015] manually collected a large-scale free-form Image-QA dataset. Gao et al. [Gao et al.2015] also manually collected the FM-IQA (Freestyle Multilingual Image Question Answering) dataset with the help of the Amazon Mechanical Turk platform. Most of these methods require a large amount of human labor for collecting data. In contrast, we propose to automatically convert existing video description dataset [Li et al.2016] into question answering dataset.
5.2 Question Generation
Automatic question generation is an open research topic in natural language processing. It is originally proposed for educational purpose[Gates2008]. In our situation, we need the generated questions to be as diverse as possible so that it can well match the property of questions generated by human annotators. Among the question generation approaches [Rus and Arthur2009, Gates2008, Heilman and Smith2010], we employ the method from Heilman and Smith [Heilman and Smith2010] to generate our video-QA pairs from video description datasets. Their approach generates questions in open domains. Similar idea has been utilized by Ren et al. [Ren et al.2015] to turn image description datasets into Image-QA datasets. They generate only four types of questions: objects, numbers, color and locations. Their answers are mostly single words. On the contrary, we generate a large amount of open-domain questions where the corresponding answers are basically phrases.
Video-based question answering is a largely unexplored problem compared with Image-QA. Previous work usually combine other text information. Tapaswi et al. [Tapaswi et al.2016] combine videos with plots, subtitles and scripts to generate answers. Tu et al. [Tu et al.2014] also combine video and text data for question answering. Zhu et al. [Zhu et al.2015] collect a dataset for ”fill-in-the-blank” type of questions. Mazaheri et al. [Mazaheri et al.2016] also consider the fill-in-the-blank problem. Comparing with Image-QA, Video-QA is more troublesome because of the additional temporal dimension. The useful information is scattered in different frames. The temporal coherence must be well addressed to better understand the videos. Our proposed method focuses on the Multiple-Choice type of questions where the candidate answers are basically phrases. We collect the dataset by turning existing video description datasets automatically into Video-QA datasets which saves a lot of human labor. Moreover, we propose a model which can better utilize the temporal property of the videos and handle the answers in phrase form.
6 Conclusion and Future Works
We propose to collect a large-scale Video-QA dataset by automatically converting from the video description dataset. To tackle the Video-QA task, we propose two mechanisms: the re-watching and the re-reading mechanisms and then combine them into an effective forgettable-watcher model. In the future, we will improve the quality and increase the quantity of our dataset. We will also consider more QA types especially the open-ended QA problems.
- [Antol et al.2015] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
- [Bigham et al.2010] Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Little, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342. ACM, 2010.
- [Gao et al.2015] Haoyuan Gao, Junhua Mao, Jie Zhou, Zhiheng Huang, Lei Wang, and Wei Xu. Are you talking to a machine? dataset and methods for multilingual image question. In Advances in Neural Information Processing Systems, pages 2296–2304, 2015.
- [Gates2008] Donna M Gates. Automatically generating reading comprehension look-back strategy: questions from expository texts. Technical report, DTIC Document, 2008.
- [Geman et al.2015] Donald Geman, Stuart Geman, Neil Hallonquist, and Laurent Younes. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
Supervised Sequence Labelling with Recurrent Neural Networks, pages 15–35. Springer, 2012.
- [Heilman and Smith2010] Michael Heilman and Noah A Smith. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617. Association for Computational Linguistics, 2010.
- [Kingma and Ba2015] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
[Li et al.2016]
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg,
Alejandro Jaimes, and Jiebo Luo.
Tgif: A new dataset and benchmark on animated gif description.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[Ma et al.2016]
Lin Ma, Zhengdong Lu, and Hang Li.
Learning to answer questions from image using convolutional neural network.In
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
- [Malinowski and Fritz2014] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In Advances in Neural Information Processing Systems, pages 1682–1690, 2014.
- [Mazaheri et al.2016] Amir Mazaheri, Dong Zhang, and Mubarak Shah. Video fill in the blank with merging lstms. arXiv preprint arXiv:1610.04062, 2016.
- [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR, 2013.
- [Noh et al.2016] Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Image question answering using convolutional neural network with dynamic parameter prediction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- [Ren et al.2015] Mengye Ren, Ryan Kiros, and Richard Zemel. Exploring models and data for image question answering. In Advances in Neural Information Processing Systems, pages 2953–2961, 2015.
- [Rus and Arthur2009] Vasile Rus and C Graesser Arthur. The question generation shared task and evaluation challenge. In The University of Memphis. National Science Foundation. Citeseer, 2009.
- [Simonyan and Zisserman2014] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- [Tapaswi et al.2016] Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Movieqa: Understanding stories in movies through question-answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- [Tu et al.2014] Kewei Tu, Meng Meng, Mun Wai Lee, Tae Eun Choe, and Song-Chun Zhu. Joint video and text parsing for understanding events and answering queries. IEEE MultiMedia, 21(2):42–70, 2014.
- [Yang et al.2016] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
- [Zhu et al.2015] Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G Hauptmann. Uncovering temporal context for video question and answering. arXiv preprint arXiv:1511.04670, 2015.
- [Zhu et al.2016] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.