AVSD is a challenging task as it involves complex dependencies from features of multiple modalities. First, the video input typically involves both visual and audio features, each of which contains various information related to the current dialogue context and user utterance. For example, in Figure 1, certain questions from user concern either visual or audio information or both. The two types of features can complement each other to support the dialogue agent to generate responses. Second, dialogues also involve complex semantic dependencies among dialogue turns, each consisting of a pair of user utterance and system response. For example, in the second turn of the second (lower) example in Figure 1
, the dialogue agent needs to refer to the previous user utterance and system response to understand what the user is asking. We are motivated to address these two challenges by adopting the Multimodal Transformer Network (MTN)[le-etal-2019-multimodal]
. The model adopts attention mechanism that focuses on the interaction between each token position in text sequences and each temporal step of video visual and audio features. The multi-head structure allows the models to project feature representation to different feature spaces and detect different types of dependencies. In addition, while previous work has achieved promising results through complex reasoning neural networks to select important video features[sanabria2019cmu] [hori2019end], we further investigate the generation capability of the models by using pointer network that can point to tokens from multiple source sequences during each generation step. Pointer network is widely used in summarization problem whereby pointer network is used to copy tokens from the source text input to generate summarizing sentences. We are motivated by this strategy and adopt into video dialogue task to enhance the quality of the generated system responses. We experiment with various model variants and notice interesting findings to improve our model performance. we noted that adopting pointer network generation can boost performance significantly. This could be explained due to the enhanced generation capability of the models being able to copy tokens from relevant text input. We present comprehensive experiments and reported results on the validation and test sets which lead to these findings.
The AVSD benchmark in DSTC7 and DSTC8 is considered an extension of two major research directions: Video QA and Dialogues. Video QA models [xu2017video] [fan2019heterogeneous] [gao2018motion] aim to improve the text and vision reasoning to be able to answer questions from users about a given video. Compared to Video QA, video dialogues such as the AVSD benchmark, however, has two major challenges: (1) First, in video dialogues, the model is required to have a strong language understanding over text input, including not only user queries but also past dialogue turns. A user query may refer to previous information mentioned in previous dialogue turns and such references must be learned to answer the query correctly. (2) Secondly, most of video QA models [jang2017tgif] [lei-etal-2018-tvqa] [kim2019progressive] are more suitable for the retrieval-based setting. In this setting, the model is typically given a list of response candidates and the model has to select one of them as the output. Compared to AVSD, this retrieval-based setting might not be appropriate as dialogue agents need to be able to converse with human users by generating natural responses rather than selecting from a predefined list of sentences. We are motivated by these two major challenges and propose to improve the language modeling part through pointer networks [nips2015pointer]. Adapting to a video dialogue task, we enhance our generative network with pointer distributions over source sequences and construct multiple vocabulary distributions during each generation steps.
The input includes a video , dialogue history of turns. Each turns consists of a pair of (human utterance, sytem response) , and current human utterance . The output is a system response . The input video can contain features in different modalities, including vision, audio, and text (such as video caption or subtitle). Given current dialogue turn , we can denote each text input as a sequence of tokens. Each token is represented by a unique token index from a vocabulary set . We denote the dialogue history , user utterance , text input of video , and output response .
Following similar video-based NLP tasks such as video captioning [aafaq2019spatio] and video QA [jang2017tgif], we assume access to a pretrained model to extract visual or audio features of input video. We extracted the visual features from a pretraind 3D-CNN based ResNext [hara2018can] similarly as [sanabria2019cmu]
. The 3D-CNN model extracts the video features based on clips rather than frames. The clip-based information is expected to be more consistent and less noisy than frame-based information. To sample clips, we use a window size of 16 video frames and stride of 16 frames. We denote the extracted features as the representationfor visual features, where is the number of resulting video clips and 2048 is the output dimension in ResNext. We used the ResNext101 pretrained on the Kinetics Human Action Video benchmark. For audio features, following [hori2019end], we use a pretrained VGGish model [hershey2017cnn]. This model is based on the image CNN model VGG to extract the temporal variation of video sound. The output is a 128-dimensional representation. We denote the extracted features as the representation .
The baseline for AVSD benchmark [alamri2018audio] [hori2019end] was provided by the organizers and based on feature fusioning approach proposed by [yu2016video]
. Video features of multiple modalities, including visual and audio, are combined by passing them through a linear transformation to a common target dimension. The projected representation is used as input to a softmax layer to combine scores of each temporal steps of visual or audio features.
Multimodal Transformer Network (MTN)
We adopt the MTN model proposed by [le-etal-2019-multimodal] in the AVSD benchmark in DSTC8. To improve the performance, we enhance the generation capability of the model and investigate an ensemble approach. We summarize the MTN model and our changes below.
Multi-head Attention. The MTN model adopts the multi-head dot-product attention mechanism proposed by [vaswani17attention] to obtain dependencies between each token in text sequences and temporal variation of video features. Specifically, the model considered attention from query to other video feature modalities, including visual and sound features. The output from this attention network is used as input in the decoder. The decoder adopts a similar attention mechanism but the attention direction is from the target system responses to other information. We denote the attention operation of 2 sequence representations from to as defined by [vaswani17attention] as:
where is the embedding dimension. The attention operation is combined with feed-forward network and skip connection to combine information of the original with . The attention is performed over multiple rounds and in each round, the output is used as input to the next attention steps. This technique allows progressive feature learning to detect complex dependencies between different information. MTN adopts the Equation 1 and 2 in query-guided and target-response-guided attention layers to obtain dependencies between user queries/target responses and other input. First, user query/utterance is used to select important video and audio features of the video. For each type of features, the embeddings of user query is passed to a self-attention layer and another attention layer that attends on video information. Firstly, the query features are used to attend on temporal visual information:
Each output has the same dimension as . Similarly, the query features are used to attend on temporal audio information:
The self-attention is applied separately for each feature type to allow the model to independently select different information from user query for different types of video features. The two representations and contain temporally attended audio and visual features from video. They are passed to the decoder network which processes information from text input (user queries, dialogue history) as well as video input. Specifically, the target responses is embedded into representation and passed to 4 text-to-text attention: self-attention, response-to-dialogue-history attention, and response-to-query-attention, and response-to-caption attention.
The last output is used to attend on the video attended features obtained from query-guided attention layers.
The MTN architecture allows the information from different text input and video information from different modalities is incorporated sequentially into the target response representation. Adopting the skip connection technique, MTN network can be used to progressively learn and refine signals obtained in each attention steps. For query-guided attention layers, the progressive learning is done by replacing or as as in the next round of attention. Similarly, in decoder layers, signals can be further attended progressively by replacing as in the next round of attention.
Pointer Generator. We examine an extension of MTN by adopting the pointer network [nips2015pointer] to generate system responses. We propose to use pointer network to point to tokens from different input text sequences and construct different vocabulary distribution where is a predefined vocabulary set based on words in the training set. Given an input text with embedding representation and the output from the last attention layer from the decoder , we construct the pointer distribution by the dot-product attention:
For each position in , the pointer distribution is used to construct a distribution over vocabulary set
where the probability of each token is accumulated from the pointer distribution of the corresponding position. Given a positionin the target response, the vocabulary distribution of this position is defined based on the pointer distribution is defined as:
where denotes the row from probability matrix . the We concatenate the probability in all position to obtain . For each text input sequence, we obtain the pointer distribution and corresponding vocabulary distribution: for , for , and for . Besides these pointer-based vocabulary distributions, we adopt a linear transformation layer with Softmax to allow the models to generate tokens not included in any text input sequences.
. To combine the vocabulary distributions, we compute importance scores based on a contextual vectors concatenated from the component input text representations and the output of the decoder.
where is the expanded version of to match the dimensions of , and . The final vocabulary distribution is computed as the weighted sum of pointer-based distributions and generation-based distribution based on the score matrix . The resulting distribution is denoted as .
Optimization. We optimize the model by training it to minimize the generation loss:
In addition, a key component of the MTN model is the auxiliary loss function applied to the output of query-guided attention. This technique was proposed by[le-etal-2019-multimodal]
to make the training more stable by using the output of attended features as representations for re-generating the user query. This auto-encoder technique was motivated from the multi-task learning approach in neural machine translation (NMT)[luong2016iclr_multi]. The difference is that MTN extracts the intermediate representations from the (auto-)encoder as video signals for decoding responses rather than just the hidden states of an LSTM encoder of the source sequence in the NMT setting. To re-generate user queries, the output from query-guided attention is passed to a linear transformation. We share the weights of the linear layer with in Equation 5.
The auto-encoding loss is defined as:
The model are jointly trained with all losses.
We simply set and to 1 for joint training.
Ensemble Models. A popular technique to improve the performance is to ensemble models trained in different settings. In our submission, we ensemble models trained independently with different video feature types and different feature pretrained models. In each model , we obtain the output vocabulary distribution . The ensembled vocabulary distribution is simply the sum of all vocabulary distributions of component model variants. The resulting summation is passed through a normalization layer to normalize all values from 0 to 1.
We use the AVSD dataset provided in DSTC8 [alamri2018audio] [hori2019end] which contains dialogues grounded on the Charades videos [sigurdsson2016hollywood]. Following the same track in the DSTC7 challenge, the DSTC8 organizers provided crowd-sourced data of video-based dialogues, including user questions and system responses constructed as dialogues, video captions, and video summaries. We present a summary of the dataset for training, validation, and test set in Table 1. The statistics of the official test dataset for DSTC8 challenge are comparable to those in the DSTC7 challenge: 1,710 dialogues and more than 6,700 dialogue turns. Please refer to more details on data collection described in [alamri2018audio]. We construct the vocabulary set including unique tokens in the training set. In our experiments, we use the provided video summary annotation as the video-dependent text input.
We adopt the Adam optimizer [kingma2014adam] with , , and . We adopt a learning rate strategy similar to [vaswani17attention]. We set the learning rate warm-up step to 13,000 training steps and train models up to epochs. We initialize all models with uniform distribution [glorot2010understanding]. We select the best models based on the average loss per epoch in the validation set. We experiments with following model hyper-parameters: embedding dimension , number of rounds of attention , attention heads . We tuned hyper-parameters following grid-search over the validation set. We allow the pointer generator to point to tokens of video summary and the last user query. Experiment results with other combinations of input text sequences for pointer generator are reported in the Ablation Analysis. In all experiments with more than one feature type, we adopt the ensemble strategy as described above. We select a batch size of 32 and dropout rate of 0.5. The dropout is applied to all layers except the generator network layers. We train our models by applying label smoothing [szegedy2016rethinking] on the target system responses . During inference, we adopt a beam search technique with a beam size of and a length penalty of .
We report the objective scores, including BLEU [papineni2002bleu], METEOR [banerjee2005meteor], ROUGE-L [lin2004rouge], and CIDEr [vedantam2015cider]. The metrics are formulated to compute the word overlapping between predicted responses and ground-truth responses.
Results on DSTC8 Test
We first report the results on the DSTC8 test dataset. The results were released by the competition organizer as the ground-truth labels are not publicly accessible. We submitted different model variants based on the settings of input: (1) text only and (2) text and video. In the text-only setting, we remove any visual or audio features and only use text input (including video caption) as input to our model. In the text-and-video setting, we submitted different versions of our models that either use visual or audio (or both) features combined. For visual features, besides ResNext101 as our main visual features, we also utilize the I3D features provided by the organizer. The features are extracted from an I3D [carreira2017quo] model pretrained on the Kinetics dataset. The features have a dimension 2048, the same as ResNext101 features.
From Table 2
, we noted that the performance among the visual features i.e. ResNext, I3D(RGB), and I3D(Flow), are comparable, especially between I3D(RGB) and I3D(Flow), the differences between objective metrics are minor. When only using audio features extracted from VGGish, we note that the performance slightly improves but not significantly as compared to only using visual features. As compared to the original MTN approach[le-etal-2019-multimodal]
, we noted the difference in performance between models that use either visual or audio features is substantially reduced. We noted similar observations as we compared the difference of performance between models that only use text features and models that use visual features. We expect these performance gains come from using pointer generators which can point to tokens in the source sequences i.e. user queries and video summaries. Since AVSD is formulated as a generation task with evaluation metrics based on similarity between the generated sentences and the ground truth, we could substantially improve the performance by focusing on the language component of the model. We also observed that using a simple ensemble technique could improve the performance, mainly in BLEU-based metrics. In this case, the ensemble strategy acts as a regularization factor on the vocabulary distribution of the output, resulting in more semantically correct output sentences. However, other metrics do not improve or reduce when performing model ensemble. We obtain the human evaluation scores from the organizers for two of our models. Our models achieve human scores of more than 3.6 on a scale of 4 and were ranked top 5 and 6 among all submissions in the AVSD track.
Results on DSTC7 Test
We also reported the results of the submitted models mentioned above but tested on the test set of DSTC7. We note similar observations as ones seen with the test data in DSTC8. The overall performance is, however, higher in DSTC7 than DSTC8. This reveals that the new test data in DSTC8 is more challenging and the current approach could be further improved.
We evaluated our models with different variants of pointer networks by allowing the models pointing to tokens of different combinations of the text input sequences. In these experiments, we choose the video-and-test setting and only use visual features extracted from the pretrained ResNext101. From Table 4, we have the following observations. First, most of the MTN models with our proposed pointer network shows improvement over one that only uses a linear transformation to generate tokens. The performance gain is substantial when we allow the models to point to source sequences of video summaries and user queries. However, the performance is slightly affected when we use pointer network to point to tokens in dialogue history because user queries and dialogue history typically contain more useful information than dialogue history to generate system responses. Secondly, when combining different input sequences with multiple pointer networks, the model with the best performance is one that contains pointers to both video summaries and user queries. By extending the pointer network to MTN and adopting a dynamic combination of vocabulary distributions among pointers, we can boost the language generation capability of the models and generate better responses.
In this paper, we present our submission AVSD track of the DSTC8 challenge. Our submissions achieve competitive performance in both human evaluation and automatic metrics, including BLEU, ROUGE, METEOR, and CIDEr. The task is challenging because it involves video information of multiple modalities, including visual and audio information, and it requires strong language modeling capability to generate natural dialogue responses. In this work, we focus on the second aspect by adopting pointer networks in generative components. Our experiment results show that adopting this technique in video dialogues can improve the quality of the responses. In the future, we will focus to extend on the first aspect by improving the multimodal reasoning capability between language, visual, and audio features.