We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques [2, 5] and Audio Understanding  are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods [17, 8], end-to-end multimodal SDS  are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of ‘topics’ as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We develop and test our approaches on the Audio Visual Scene-Aware Dialog (AVSD) dataset  released as a part of the DSTC7. We present the analysis of our experiments and show that some of our model variations outperform the baseline system  released for AVSD.
2 Model Description
Topic Model Explorations: Topics form a very important source of context in a dialog. We train Latent Dirichlet Allocation (LDA ) and Guided LDA  models on questions, answers, QA pairs, captions and history and incorporate the topic distributions as features or use them to learn topic embeddings. Since we are interested in identifying specific topics (e.g., entertainment, cooking, cleaning), we use Guided LDA to generate topics based on seedwords.
: We explore several configurations of the model where at every step, the decoder attends to the dialog history representations and AV features to selectively focus on relevant parts of the dialog history and audio/video. This helps create a combination of the dialog history and multimodal context that is richer than the single context vectors of the individual modalities.
Audio Feature Explorations: We used an end-to-end audio classification ConvNet, called AclNet . AclNet takes raw, amplitude-normalized 44.1 kHz audio samples as input, and produces classification output without the need to compute spectral features. AclNet is trained using the ESC-50  corpus, a dataset of 50 classes of environmental sounds organized in 5 semantic categories (animals, interior/domestic, exterior/urban, human, natural landscapes).
We use the dialog dataset  consisting of conversations between two people about a video (from Charades human action dataset ), which was released as part of the AVSD challenge at DSTC7. We use the official-training (7659 dialogs) and prototype-validation sets (732 dialogs) to train, and prototype-test set (733 dialogs) to evaluate our models.
4 Experiments and Results
Topic Experiments: We use separate topic models trained on questions, answers, QA pairs, captions, history and history+captions to generate topics for samples from each category. Table 1 compares the baseline model with the addition of StandardLDA and GuidedLDA topic distributions as features to the decoder, as well as by learning topic embeddings. In general, GuidedLDA models perform better than StandardLDA, and GuidedLDA + GloVe  is our best performing model.
Audio Experiments: Table 2 shows the comparison of the baseline B (without audio features), and B+VGGish (provided as a part of the AVSD task) and B+AclNet features. We analyse the effects of audio features on the overall dataset as well as specifically on audio-related dialogs. From Table 2, we observe that B+AclNet performs the best both on overall dataset and audio-related dialogs.
Attention Experiments: Table 3 shows that the configuration where decoder attends to all of the sentence-LSTM output states performs better than the baseline. In order to compare the results based on semantic meaningfulness, we performed quantitative analysis on dialogs that contained binary answer in the ground truth. We evaluate our models on their ability to predict these binary answers correctly and present this analysis in Figure 2 which shows once again that the configuration where decoder attends to all of the sentence-LSTM output states performs best on binary answer evaluation.
S-LSTM all + AV
In this paper, we present some explorations and techniques for contextual and multimodal end-to-end audio-visual scene aware dialog system. We incorporate context of the dialog in the form of topics, we use various attention mechanisms to enable the decoder to focus on relevant parts of the dialog history and audio/video features, and incorporate audio features from an end-to-end audio classification architecture, AclNet. We validate our approaches on the AVSD dataset and show that these techniques give better performance compared to the baseline.
-  (2018) Audio visual scene-aware dialog (AVSD) challenge at DSTC7. CoRR abs/1806.00525. External Links: Cited by: §1, §3.
VQA: visual question answering.
2015 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2425–2433. External Links: Cited by: §1.
Latent dirichlet allocation.
Journal of machine Learning research3 (Jan), pp. 993–1022. Cited by: §2.
-  W. Daelemans, M. Lapata, and L. Màrquez (Eds.) (2012) EACL 2012, 13th conference of the european chapter of the association for computational linguistics, avignon, france, april 23-27, 2012. The Association for Computer Linguistics. External Links: Cited by: 10.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Cited by: §1.
-  (2017) CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, Cited by: §1.
-  (2018) End-to-end audio visual scene-aware dialog using multimodal attention-based video features. CoRR abs/1806.08409. External Links: Cited by: §1.
-  (2017-10) Attention-based multimodal fusion for video description. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: Cited by: §1.
-  (2018) ACLNET: efficient end-to-end audio classification cnn. arXiv preprint arXiv:1811.06669. External Links: Cited by: §2.
-  (2012) Incorporating lexical priors into topic models. See EACL 2012, 13th conference of the european chapter of the association for computational linguistics, avignon, france, april 23-27, 2012, Daelemans et al., External Links: Cited by: §2.
-  (2019) Context, attention and audio feature explorations for audio visual scene-aware dialog. In DSTC7 Workshop at AAAI 2019, arXiv:1812.08407, External Links: Cited by: footnote 1.
-  B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.) (2016) Computer vision - ECCV 2016 - 14th european conference, amsterdam, the netherlands, october 11-14, 2016, proceedings,part I. External Links: Cited by: 16.
GloVe: global vectors for word representation.
Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §4.
-  (2015-10-13) ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. External Links: Cited by: §2.
Building end-to-end dialogue systems using generative hierarchical neural network models.. In AAAI, Vol. 16, pp. 3776–3784. Cited by: §1.
-  (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. See Computer vision - ECCV 2016 - 14th european conference, amsterdam, the netherlands, october 11-14, 2016, proceedings,part I, Leibe et al., External Links: Cited by: §3.
Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584–4593. Cited by: §1.