We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques [3, 7] and audio understanding  are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods [22, 12], end-to-end multimodal Spoken Dialog Systems (SDS)  are now being trained to meaningfully communicate in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of ‘topics’ of the dialog as the context of the conversation along with multimodal attention into an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We develop and test our approaches on the Audio Visual Scene-Aware Dialog (AVSD) dataset [2, 1] released as part of the 7th Dialog System Technology Challenges (DSTC7) task, showing that some of our model variations outperform the AVSD baseline model .
2 Model Description
In this section, we describe the main architectural explorations of our work as shown in Figure 1.
Adding Topics of Conversations: Topics form a very important source of context in a dialog. Charades dataset  contains videos on common household activities such as watching TV, eating, cleaning, using a laptop, sleeping, and so on. We train Latent Dirichlet Allocation (LDA)  and Guided LDA  models on questions, answers, QA pairs, captions and dialog history. Since we are interested in identifying domain-specific topics such as entertainment, cooking, cleaning, resting, etc., we use Guided LDA to generate topics via seed words. A detailed list of sample seed words provided to Guided LDA for the 9-topics configuration is presented in Table 1. These seed words are constructed by identifying a set of most common nouns (objects), verbs, scenes, and actions from the Charades dataset analysis . Generated topic distributions are incorporated as features into our models or used to learn topic embeddings.
|Entertainment/LivingRoom||living, room, recreation, garage, basement, entryway, television, tv, phone, laptop, sofa, chair, couch, armchair, seat, picture, sit …|
|Cooking/Kitchen||kitchen, pantry, food, water, dish, sink, refrigerator, fridge, stove, microwave, toaster, kettle, oven, stewpot, saucepan, cook, wash …|
|Eating/Dining||dining, room, table, chair, plate, fork, knife, spoon, bowl, glass, cup, mug, coffee, tea, sandwich, meal, breakfast, lunch, dinner …|
|Cleaning/Bath||bathroom, hallway, entryway, stairs, restroom, toilet, towel, broom, vacuum, floor, sink, water, mirror, cabinet, hairdryer, clean …|
|Dressing/Closet||walk-in, closet, clothes, wardrobe, shoes, shirt, pants, trousers, skirt, jacket, t-shirt, underwear, sweatshirt, coat, rack, dress, wear …|
|Laundry||laundry, room, basement, clothes, clothing, cloth, basket, bag, box, towel, shelf, dryer, washer, washing, machine, do, wash, hold …|
|Rest/Bedroom||bedroom, room, bed, pillow, blanket, mattress, bedstand, nightstand, commode, dresser, bedside, lamp, nightlight, night, light, lie …|
|Work/Study||home, office, den, workroom, garage, basement, laptop, computer, pc, screen, mouse, keyboard, phone, desk, chair, light, work, study …|
|Sports/Exercise||recreation, room, garage, basement, hallway, stairs, gym, fitness, floor, bag, towel, ball, treadmill, bike, rope, mat, run, walk, exercise …|
Attention Explorations: We explore several configurations of the attention-based model where at every step, the decoder attends to the dialog history representations and audio/video (AV) features to selectively focus on relevant parts of the dialog history and AV. We calculate the attention weights [4, 21]
corresponding to every dialog history turn, multimodal features and the decoder representation, and apply the weights to the history and multimodal features to compute the relevant representations. These help create a combination of the dialog history and multimodal context that is richer than the single context vectors of the individual modalities. We append the input encoding along with the AV multimodal feature encodings and pass that to the decoder LSTM for learning the output encodings.
Audio Feature Explorations: We used an end-to-end audio classification ConvNet, called AclNet . AclNet takes raw, amplitude-normalized 44.1 kHz audio samples as input, and produces classification output without the need to compute spectral features. AclNet is trained using the ESC-50  corpus, a dataset of 50 classes of environmental sounds organized in 5 semantic categories (animals, interior/domestic, exterior/urban, human, natural landscapes).
We use the dialog dataset consisting of conversations between two parties about short videos (from Charades human action dataset ), which was released as part of the AVSD challenge track of DSTC7 . The two parties in the conversation discuss about events in the video, where one plays the role of a questioner and the other is the answerer . For the results presented in this work, we use the official training and validation sets to train and optimize our models, which are evaluated on the official test set. Table 2 shows the distribution of DSTC7 AVSD data across different sets. Further details of our AVSD dataset analysis and previous results on prototype sets can be found in .
|# of Dialogs||7,659||1,787||1,710|
|# of Turns||153,180||35,740||13,490|
|# of Words||1,450,754||339,006||110,252|
4 Experiments and Results
|GuidedLDA (Q,QA,C) + GloVe||0.629||0.491||0.390||0.315||0.219||0.484||0.731|
|StandardLDA (All topics)||0.621||0.480||0.380||0.306||0.221||0.483||0.753|
|GuidedLDA (All topics)||0.619||0.480||0.378||0.303||0.217||0.476||0.701|
|GuidedLDA (All topics) + GloVe||0.631||0.493||0.390||0.315||0.224||0.492||0.773|
|HLSTM with topics||0.627||0.489||0.387||0.311||0.218||0.480||0.723|
|Topic Embeddings + GloVe||0.632||0.499||0.402||0.329||0.223||0.488||0.762|
|GuidedLDA (Q,QA,C) + GloVe||0.616||0.476||0.374||0.301||0.215||0.474||0.673|
|GuidedLDA (All topics) + GloVe||0.629||0.486||0.381||0.306||0.223||0.488||0.728|
|HLSTM with topics||0.623||0.480||0.375||0.297||0.214||0.473||0.696|
|Topic Embeddings + GloVe||0.635||0.497||0.398||0.325||0.224||0.491||0.746|
|GuidedLDA (Q,QA,C) + GloVe||0.633||0.497||0.396||0.320||0.220||0.487||0.759|
|GuidedLDA (All topics) + GloVe||0.632||0.495||0.394||0.318||0.225||0.494||0.796|
|HLSTM with topics||0.629||0.492||0.392||0.316||0.220||0.483||0.740|
|Topic Embeddings + GloVe||0.630||0.499||0.403||0.330||0.223||0.487||0.776|
Topic Modeling Experiments: We use separate topic models trained on questions (Q), answers (A), QA pairs, captions (C), history and history+captions to generate topics for samples from each category. The generated topic vectors are incorporated as features for questions and dialog history. The question topics are added to the decoder state directly. In one variation, the dialog history topics (QA and C, or all topics) are copied to all the decoder states directly. In another variation, the dialog history topics are added as features to the history encoder LSTM (HLSTM). We learn topic embeddings from topics generated for the questions, QA pairs and captions as well. In addition, GloVe embeddings  (200-dim) are incorporated with fine-tuning for questions and history.
Table 3 compares the baseline model  with the topic-based model variations. GuidedLDA (All topics) + GloVe performs better than the baseline in all metrics. Adding topics as part of the HLSTM also slightly improves performance compared to the baseline. Learning topic embeddings along with the word embeddings (+GloVe fine-tuning) achieves the best performance in most of the metrics (BLEU-scores), whereas GuidedLDA (All topics) + GloVe succeeds in other metrics. We also evaluated topic-based models on subsets having binary and non-binary answers. As shown in Table 4, for the non-binary subset, all topic-based models perform better than the baseline in all metrics, which shows that these models can generate better responses for the more complex, non-binary answers.
Attention Experiments: The baseline architecture  only leverages the last hidden state information from the sentence LSTM in the dialog history encoder. In our experiments, we have modified the baseline architecture and added attention layer for the answer decoder to leverage information directly from the dialog history LSTMs and multimodal audio/video features, with 4 different configurations described below. To evaluate the performance of attention solely for questions that could benefit from dialog history, we isolate the questions containing coreferences. Table 5 shows the performance of our models on this coreference-subset. To compare the results at a more semantic level, we further performed quantitative analysis on dialogs that contained binary answers. We evaluate our models on their ability to predict these binary answers correctly (using precision, recall and F1-scores) as presented in Figure 2. The results show that the configuration where decoder attends to all of the sentence-LSTM output states performs better than the baseline.
|Word LSTM (all output states)||0.594||0.447||0.336||0.262||0.190||0.432||0.553|
|Word LSTM (last hidden states)||0.627||0.485||0.379||0.297||0.208||0.468||0.701|
|Sentence LSTM (all output states)||0.619||0.484||0.384||0.307||0.213||0.472||0.749|
|Sentence LSTM (all outputs) + AV||0.598||0.464||0.360||0.284||0.209||0.458||0.685|
Attention on Dialog History Word LSTMs, all output states
: In this configuration, we remove the sentence level dialog history LSTM and the decoder computes the attention scores directly between the decoder state and the word level output states for all dialog history. We first padded the Word LSTM outputs from Dialog History LSTMs (see Word LSTM in History in Figure1) to the maximum sentence length of all the sentences. We summed up all the attention scores from each of the sentence context vectors with the query decoder state. Using this kind of attention, we had hoped that the system could remember answers that were already given (directly or indirectly) in the earlier turns of the dialog. Directly attending to the output states of the word LSTMs in the dialog history encoder did not perform well compared to the baseline. This attention mechanism possibly attended to way more information than needed.
Attention on Dialog History Word LSTMs, last hidden states
: This configuration is similar to the previous configuration with the difference that we only use the last hidden state output representations of the word LSTMs corresponding to the different turns in the dialog. Simpler than the previous setup, we stack up the hidden states from the history sentences for attention computation. This configuration performed better that the baseline on the coreference-subset in most of the evaluation metrics.
Attention on Sentence LSTM, all output states: The baseline architecture only leverages the last hidden state information from the sentence LSTM in the dialog history encoder. Instead, we extract the output states from all timesteps of the LSTM corresponding to turns of the dialog history. This variation helps the decoder consider all the dialog turn compressed sentence representations via the attention mechanism. This model performed better than the baseline in all metrics on both coreference-subset (Table 5) and binary answers (Figure 2).
Attention on Sentence LSTM, all output states and Multimodal Audio/Video Features: This configuration is similar to the last one with the difference that we add multimodal audio/video features as additional state to the attention module. This mechanism allows the decoder to selectively focus on the multimodal features along with the dialog history sentences. This configuration did not really help improve the evaluation metrics compared to the baseline.
Audio Experiments: Table 6 shows the comparison of the baseline (B) model without audio features, B+VGGish (provided as a part of the AVSD task), and B+AclNet features. We investigate the effects of audio features on the overall dataset as well as on the subset of audio-related questions. We observe that B+AclNet shows improved performances as compared to the baseline and B+VGGish, both on the overall dataset and audio-related subset. Table 7 presents a qualitative analysis of the addition of the VGGish and AclNet features to the baseline model. For these audio-related examples (e.g., ’oscillating’, ’eating’, ’sneeze’), baseline and B+VGGish models generate irrelevant responses, whereas the answers generated by B+AclNet are in accordance with the ground truth.
|B + VGGish||0.622||0.487||0.389||0.315||0.216||0.481||0.732|
|B + AclNet||0.625||0.491||0.391||0.316||0.218||0.484||0.736|
|B + VGGish||0.657||0.519||0.408||0.324||0.230||0.500||0.754|
|B + AclNet||0.659||0.527||0.424||0.348||0.236||0.507||0.796|
|Question:||is the fan oscillating ?||is he eating something ?||how many times does she sneeze ?|
|Ground Truth||the fan is on but is still .||yes he appears to be eating something||she sneezes a few times in the video .|
|Baseline||yes it is very well lit||no he is not drinking anything||can only see her face|
|Baseline + VGGish||no don ’t see any music||no he is not drinking anything||she laughs at the end of the video|
|Baseline + AclNet||no it is hard to tell||yes he is eating sandwich||she sneezes at the end of the video|
In this paper, we present our explorations towards architectural extensions for contextual and multimodal end-to-end audio-visual scene-aware dialog system. We incorporate context of the dialog in the form of topics, investigate various attention mechanisms to enable the decoder to focus on relevant parts of the dialog history and audio/video features, and incorporate audio features from an end-to-end audio classification architecture, AclNet. We validate our approaches on the AVSD dataset and show that some of the explored techniques yields in improved performances compared to the baseline system for AVSD task.
-  (2019) Audio-visual scene-aware dialog. In , Cited by: §1, §3.
-  (2018) Audio visual scene-aware dialog (AVSD) challenge at DSTC7. CoRR abs/1806.00525. External Links: Cited by: §1, §3.
-  (2015-12) VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2425–2433. External Links: Cited by: §1.
-  (2014) Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. External Links: Cited by: §2.
-  (2001) Latent dirichlet allocation. See Advances in neural information processing systems 14 [neural information processing systems: natural and synthetic, NIPS 2001, december 3-8, 2001, vancouver, british columbia, canada], Dietterich et al., External Links: Cited by: §2.
-  W. Daelemans, M. Lapata, and L. Màrquez (Eds.) (2012) EACL 2012, 13th conference of the european chapter of the association for computational linguistics, avignon, france, april 23-27, 2012. The Association for Computer Linguistics. External Links: Cited by: 14.
-  (2017-07) Visual dialog. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §1.
-  T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.) (2001) Advances in neural information processing systems 14 [neural information processing systems: natural and synthetic, NIPS 2001, december 3-8, 2001, vancouver, british columbia, canada]. MIT Press. External Links: Cited by: 5.
-  I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.) (2017) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA. Cited by: 21.
-  (2017) CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, Cited by: §1.
-  (2018) End-to-end audio visual scene-aware dialog using multimodal attention-based video features. CoRR abs/1806.08409. External Links: Cited by: §1, §4, §4.
-  (2017-10) Attention-based multimodal fusion for video description. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: Cited by: §1.
-  (2018) ACLNET: efficient end-to-end audio classification cnn. arXiv preprint arXiv:1811.06669. External Links: Cited by: §2.
-  (2012) Incorporating lexical priors into topic models. See EACL 2012, 13th conference of the european chapter of the association for computational linguistics, avignon, france, april 23-27, 2012, Daelemans et al., External Links: Cited by: §2.
-  (2019) Context, attention and audio feature explorations for audio visual scene-aware dialog. In DSTC7 Workshop at AAAI 2019, arXiv:1812.08407, External Links: Cited by: §3.
-  B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.) (2016) Computer vision - ECCV 2016 - 14th european conference, amsterdam, the netherlands, october 11-14, 2016, proceedings,part I. External Links: Cited by: 20.
GloVe: global vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §4.
-  (2015-10-13) ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. External Links: Cited by: §2.
Building end-to-end dialogue systems using generative hierarchical neural network models.. In AAAI, Vol. 16, pp. 3776–3784. Cited by: §1.
-  (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. See Computer vision - ECCV 2016 - 14th european conference, amsterdam, the netherlands, october 11-14, 2016, proceedings,part I, Leibe et al., External Links: Cited by: §2, §3.
-  (2017) Attention is all you need. See Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, 4-9 december 2017, long beach, ca, USA, Guyon et al., External Links: Cited by: §2.
Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584–4593. Cited by: §1.