Transformers are now the de facto standard for language modeling and recently extending their applications in vision and multimodal domain [vaswani2017, chen2020uniter]. Transformers in the vision and language domain are usually pretrained with large-scale datasets and applied to various downstream tasks. Among downstream tasks, video question answering evaluates whether the model understands various dimensions of video contents and is usually done in multiple-choice. However, when learning a model for multiple-choice video question answering, the model selects the correct answer by comparing the similarity between the question and the answer candidates rather than inferring the correct answer to the question. But, selecting the correct answer through comparison with the answer candidates does not perform the reasoning required in the question and answering, making it difficult to generalize for other tasks.
In this paper, we tackle the current multiple-choice video question answering dataset by changing it into an open-ended format. The answer candidates are not given in open-ended multimodal video question answering, so the model infers the correct answer through reasoning. In addition, it is possible to develop a model that can be applied to other tasks except for the decoder part that generates the correct answer.
Challenging open-ended multimodal video question answering, we propose an extended model that learns various modalities together based on the recently proposed Transformer language model. The proposed model receives various metadata and language input of video. The results show that performance can be improved by combining multiple metadata rather than features from raw videos.
This paper is organized as follows. Chapter 2 examines related works to video question answering and open-ended question answering. Chapter 3 describes the proposed model and learning strategy. Chapter 4 examines the dataset and experimental settings, as well as the quantitative results. Finally, in Chapter 5, the conclusion and future research directions are described.
2 Related Work
2.1 Video Question Answering
A variety of video question-answering datasets have been proposed, including MovieQA[tapaswi2016movieqa], PororoQA[kim2017deepstory], TGIF-QA[jang2017tgif], TVQA[lei2019tvqa], DramaQA[choi2020dramaqa], and are mostly in the multiple-choice format. AVSD Dataset[alamri2019audiovisual] is characterized by the fact that question-answering for video is in the form of dialogue, which is out of the existing multiple-choice form.
Recently, various approaches have been proposed for video story question answering, which can be divided into three categories. There are techniques using Memory Network[tapaswi2016movieqa, kim2017deepstory], Attention[kim2017deepstory, lei2019tvqa], and Transformer[yang2020bert]. Memory networks stores and utilizes key information about a question-answering in a memory network to find it among many information in a long video. Attention effectively represents only the representation of visual/verbal core information by progressing attention across layers. Techniques utilizing context matching by applying attention achieved high performance in question-and-answer by comparing the context of a question-and-answer with the context of a given video in detail. Recently, researchers propose transformer-based models for video question answering. [vaswani2017attention] proposed transformer and the proposed architecture brought a huge performance improvement in language modeling, and there is a move to expand it to a video domain. Recent state-of-art models show that these techniques can perform well in modeling the video as well as the language.
2.2 Opend-Ended Question Answering
In the H. Xue et al.[xue2017unifying], Z. Zhao et al.[zhao2018open], pointed out that the existing video question answering task used only one static image and text and also dealt with it as a short-word-oriented multiple-choice problem. It is emphasized that this approach cannot utilize the sequential and temporal information of the video. Therefore, its usability is limited in that the answer is chosen within given answers. In the above papers, the sequential/time information of the video was utilized to finally generate answers through decoders, resulting in better results than traditional methods (Mean-VQA, SS-VQA, etc.). However, the issues addressed by the above papers are limited in that they are short-lived, although open-ended, and the format of questions and answers is also simple.
In the [li2020bridging], the author conducted a study on AVSD task[alamri2019audiovisual](Given video and ten turns of question answering a text, task generates natural language answers to the last question) based on Transformer(GPT2[radford2019language]). This paper extracts features from video and text with I3D[carreira2017quo] and VGGish[hershey2017], applies positional encoding, Beam Search, receives good results from several metrics (BLEU, METEOR, CIDEr, etc.). However, the model is not much different from B, and the position and video feature information was not used properly.
The purpose of our model is to integrate multimodal information (e.g., subtitle, video, audio, question, etc.) to generate the open-ended answer.
Our model consists of inputs of video, question and outputs of answer. The video is represented as . is representing the n-th frame in , means a image features, and a visual meta data, the information such as person, person’s emotion and behavior, in bounding box corresponding to n-th frame, is m-th subtitle in the entire video . The question is represented as , and the answer is represented as .
Each frame can be expressed as by extracting 3 frames per second from video and then feeding in the pre-trained I3D[carreira2017quo]
model to extract feature vectors.
There is information about the character in the form of in each . and information about each character is represented as .
is a feature representation of the character’s image of bounding box using a pre-trained ResNet152[he2016deep] model. is a word embedding representation using a pre-trainned GPT2 model. is the character’s behavior. is a word embedding representation of the character’s emotion.
Each an be expressed as which which can be divided into sentence, , which can be divided into a word and a speaker Both speakers and words can be expressed in a previous way. Sentences can also be broken down into words using the GPT2 tokenizer.
We reference and use GPT2, a transformer model, which uses attention in place of the previous recurrence- and convolution-based architectures. Attention mechanisms allow the model to selectively focus on segments of input text it predicts to be the most relevant.
GPT2 models receive the feature, segment, and position as inputs. Feature refers to data that embeds text input through GPT2 tokenizer, segment refers to data that means a token type of each word, such as ¡eos¿ and ¡sos¿, and position refers to the location of each word in the sentence.
3.2.1 Feature Embedding
Feature embedding input is all of the preceding to a two-dimensional sequence over time. Subsequent similarly leads to a two-dimensional sequence over time. Finally, we attach . Therefore, the sequence length is . On the other hand, if features are extracted using I3D or ResNet, the features are different from those extracted with GPT2 models, so the dimensions are adjusted through a layer of learnable linear layers.
3.2.2 Segment Embedding
|[V]||I3D feature for each frame|
|[BBF]||2D ResNet feature for each bounding box|
|[PER]||Name of each character|
|[BEH]||Behavior of each character|
|[EMO]||Emotion of each character|
|[SPK]||Speaker of each subtitle|
Segment embedding distinguishes the various inputs that enter the video. The distinguishing features can be divided into eight as Table 1.
For each of these eight Feature categories, Segment embedding was performed using special token in GPT2.
3.3 Decoding Method
To find an effective decoding method for multimodal answer generation, we try the decoding methods, including beam search and Nucleus Sampling[holtzman2019curious]
which samples text from the dynamic nucleus of the probability distribution. Although beam search showed slightly high performance, it took about 16 times more time to use it in real-time, so Neclues Sampling was used.
3.4 Implementation Details
All experiments are run on NVIDIA [TITAN Xp]. Because of the lack of memory, we use a batch size of 1 input unit. We use AdamW optimizer[loshchilov2017decoupled] with a learning rate of 1e-4 and weight decay of 1e-5. Cross-entropy loss is used to train the model.
The evaluation is carried out using BLEU[papineni2002bleu]
based on n-gram, METEOR[banerjee2005meteor]
considering recall as a traditional metric to evaluate the generated text. In addition, we evaluate the answers generated with a total of four metrics, including BERTScore[zhang2019bertscore] which is measured based on a similarity between each token embedding and BLEURT[sellam2020bleurt] which uses the pre-learned model as metric.
4.2 Quantitative Results
|S + V||0.65||0.1||0.3||0.59|
|S + B||0.697||0.202||0.35||0.6|
|S + M||0.733||0.281||0.378||0.62|
|S + M, V||0.726||0.263||0.38||0.61|
|S + M, B||0.733||0.276||0.38||0.62|
|S + M, V, B||0.724||0.258||0.37||0.61|
stands for video features extracted from I3D,B stands for bounding box features extracted from ResNet, and M stands for visual metadata composed of person, emotion, and behavior.
Table 3 shows metadata plays a major role in improving performance. Our model is based on GPT2, so there is language bias. It helps improve performance with language metadata.
The information in bounding box features also helps answer questions by looking at S / B + S. However, comparing M + S / B, M + S did not improve performance.
Video information lowers performance. For reasons, a transformer-based model is a model with large language bias, and the entire video that is irrelevant to the question works even worse than bounding box features.
In this paper, we challenge the existing multiple-choice video question answer by converting it into an open-ended form. We construct the model in the form of a multimodal transformer by adding video and metadata from video to the existing pre-trained language model. Ablation studies using the DramaQA dataset showed that video metadata helped performance.
For future work, we plan to use the dense caption features in the video space transferred into the language space to circumvent the language bias problem.