Exploring Context, Attention and Audio Features for Audio Visual Scene-Aware Dialog

12/20/2019 ∙ by Shachi H Kumar, et al. ∙ Intel 0

We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques and Audio Understanding are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods, end-to-end multimodal SDS are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of `topics' as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We develop and test our approaches on the Audio Visual Scene-Aware Dialog (AVSD) dataset released as a part of the DSTC7. We present the analysis of our experiments and show that some of our model variations outperform the baseline system released for AVSD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are witnessing a confluence of vision, speech and dialog system technologies that are enabling the IVAs to learn audio-visual groundings of utterances and have conversations with users about the objects, activities and events surrounding them. Recent progress in visual grounding techniques [2, 5] and Audio Understanding [6] are enabling machines to understand shared semantic concepts and listen to the various sensory events in the environment. With audio and visual grounding methods [17, 8], end-to-end multimodal SDS [15] are trained to meaningfully communicate with us in natural language about the real dynamic audio-visual sensory world around us. In this work, we explore the role of ‘topics’ as the context of the conversation along with multimodal attention into such an end-to-end audio-visual scene-aware dialog system architecture. We also incorporate an end-to-end audio classification ConvNet, AclNet, into our models. We develop and test our approaches on the Audio Visual Scene-Aware Dialog (AVSD) dataset [1] released as a part of the DSTC7. We present the analysis of our experiments and show that some of our model variations outperform the baseline system [7] released for AVSD.

Figure 1: System Architecture

2 Model Description

In this section, we describe the main architecture explorations of our work111Further details can be found in the full-paper version of this work [11]. as shown in Figure 1.

Topic Model Explorations: Topics form a very important source of context in a dialog. We train Latent Dirichlet Allocation (LDA [3]) and Guided LDA [10] models on questions, answers, QA pairs, captions and history and incorporate the topic distributions as features or use them to learn topic embeddings. Since we are interested in identifying specific topics (e.g., entertainment, cooking, cleaning), we use Guided LDA to generate topics based on seedwords.

Attention Explorations

: We explore several configurations of the model where at every step, the decoder attends to the dialog history representations and AV features to selectively focus on relevant parts of the dialog history and audio/video. This helps create a combination of the dialog history and multimodal context that is richer than the single context vectors of the individual modalities.

Audio Feature Explorations: We used an end-to-end audio classification ConvNet, called AclNet [9]. AclNet takes raw, amplitude-normalized 44.1 kHz audio samples as input, and produces classification output without the need to compute spectral features. AclNet is trained using the ESC-50 [14] corpus, a dataset of 50 classes of environmental sounds organized in 5 semantic categories (animals, interior/domestic, exterior/urban, human, natural landscapes).

3 Dataset

We use the dialog dataset [1] consisting of conversations between two people about a video (from Charades human action dataset [16]), which was released as part of the AVSD challenge at DSTC7. We use the official-training (7659 dialogs) and prototype-validation sets (732 dialogs) to train, and prototype-test set (733 dialogs) to evaluate our models.

4 Experiments and Results

Bleu1 Bleu2 Bleu3 Bleu4 Meteor Rouge CIDEr
Baseline 0.273 0.173 0.118 0.084 0.117 0.291 0.766
StandardLDA 0.255 0.164 0.113 0.082 0.114 0.285 0.772
GuidedLDA 0.265 0.170 0.117 0.084 0.118 0.293 0.812
GuidedLDA-all 0.272 0.173 0.118 0.085 0.119 0.293 0.793
GuidedLDA+GloVe 0.275 0.175 0.119 0.085 0.121 0.293 0.797
Topic-embeddings 0.257 0.165 0.114 0.083 0.115 0.287 0.772
HLSTM-with-topics 0.260 0.166 0.115 0.084 0.117 0.290 0.797
Table 1: Topic Experiments

Topic Experiments: We use separate topic models trained on questions, answers, QA pairs, captions, history and history+captions to generate topics for samples from each category. Table 1 compares the baseline model with the addition of StandardLDA and GuidedLDA topic distributions as features to the decoder, as well as by learning topic embeddings. In general, GuidedLDA models perform better than StandardLDA, and GuidedLDA + GloVe [13] is our best performing model.

Bleu1 Bleu2 Bleu3 Bleu4 Meteor Rouge CIDEr
Overall
Baseline(B) 0.273 0.173 0.118 0.084 0.117 0.291 0.766
B+VGGish 0.271 0.172 0.118 0.085 0.116 0.292 0.791
B+AclNet 0.274 0.175 0.121 0.087 0.117 0.294 0.789

Audio-related
Baseline(B) 0.267 0.179 0.128 0.096 0.120 0.285 0.919
B+VGGish 0.266 0.181 0.131 0.099 0.118 0.285 0.907
B+AclNet 0.266 0.183 0.132 0.100 0.120 0.287 0.944
Table 2: Audio Feature Experiments

Audio Experiments: Table 2 shows the comparison of the baseline B (without audio features), and B+VGGish (provided as a part of the AVSD task) and B+AclNet features. We analyse the effects of audio features on the overall dataset as well as specifically on audio-related dialogs. From Table  2, we observe that B+AclNet performs the best both on overall dataset and audio-related dialogs.

Figure 2: Precision, Recall, F1 Scores for Attention Experiments on Questions Containing Coreferences

Attention Experiments: Table 3 shows that the configuration where decoder attends to all of the sentence-LSTM output states performs better than the baseline. In order to compare the results based on semantic meaningfulness, we performed quantitative analysis on dialogs that contained binary answer in the ground truth. We evaluate our models on their ability to predict these binary answers correctly and present this analysis in Figure 2 which shows once again that the configuration where decoder attends to all of the sentence-LSTM output states performs best on binary answer evaluation.

Bleu1 Bleu2 Bleu3 Bleu4 Meteor Rouge CIDEr

Baseline
0.248 0.151 0.101 0.071 0.110 0.256 0.664

Word-LSTM all
0.223 0.138 0.092 0.065 0.103 0.262 0.591

W-LSTM last
0.229 0.139 0.093 0.065 0.105 0.250 0.661

Sent-LSTM all
0.242 0.151 0.103 0.073 0.110 0.261 0.707

S-LSTM all + AV
0.234 0.146 0.099 0.070 0.109 0.254 0.690


Table 3: Decoder Attention over Dialog History and AV Features

5 Conclusion

In this paper, we present some explorations and techniques for contextual and multimodal end-to-end audio-visual scene aware dialog system. We incorporate context of the dialog in the form of topics, we use various attention mechanisms to enable the decoder to focus on relevant parts of the dialog history and audio/video features, and incorporate audio features from an end-to-end audio classification architecture, AclNet. We validate our approaches on the AVSD dataset and show that these techniques give better performance compared to the baseline.

References

  • [1] H. AlAmri, V. Cartillier, R. G. Lopes, A. Das, J. Wang, I. Essa, D. Batra, D. Parikh, A. Cherian, T. K. Marks, and C. Hori (2018) Audio visual scene-aware dialog (AVSD) challenge at DSTC7. CoRR abs/1806.00525. External Links: Link, 1806.00525 Cited by: §1, §3.
  • [2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh (2015-12) VQA: visual question answering. In

    2015 IEEE International Conference on Computer Vision (ICCV)

    ,
    Vol. , pp. 2425–2433. External Links: Document, ISSN 2380-7504 Cited by: §1.
  • [3] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation.

    Journal of machine Learning research

    3 (Jan), pp. 993–1022.
    Cited by: §2.
  • [4] W. Daelemans, M. Lapata, and L. Màrquez (Eds.) (2012) EACL 2012, 13th conference of the european chapter of the association for computational linguistics, avignon, france, april 23-27, 2012. The Association for Computer Linguistics. External Links: Link, ISBN 978-1-937284-19-0 Cited by: 10.
  • [5] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra (2017-07) Visual dialog.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .
    External Links: ISBN 9781538604571, Link, Document Cited by: §1.
  • [6] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. (2017) CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference, Cited by: §1.
  • [7] C. Hori, H. AlAmri, J. Wang, G. Wichern, T. Hori, A. Cherian, T. K. Marks, V. Cartillier, R. G. Lopes, A. Das, I. Essa, D. Batra, and D. Parikh (2018) End-to-end audio visual scene-aware dialog using multimodal attention-based video features. CoRR abs/1806.08409. External Links: Link, 1806.08409 Cited by: §1.
  • [8] C. Hori, T. Hori, T. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi (2017-10) Attention-based multimodal fusion for video description. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781538610329, Link, Document Cited by: §1.
  • [9] J. J. Huang and J. J. A. Leanos (2018) ACLNET: efficient end-to-end audio classification cnn. arXiv preprint arXiv:1811.06669. External Links: 1811.06669 Cited by: §2.
  • [10] J. Jagarlamudi, H. D. III, and R. Udupa (2012) Incorporating lexical priors into topic models. See EACL 2012, 13th conference of the european chapter of the association for computational linguistics, avignon, france, april 23-27, 2012, Daelemans et al., External Links: Link Cited by: §2.
  • [11] S. H. Kumar, E. Okur, S. Sahay, J. J. A. Leanos, J. Huang, and L. Nachman (2019) Context, attention and audio feature explorations for audio visual scene-aware dialog. In DSTC7 Workshop at AAAI 2019, arXiv:1812.08407, External Links: Link Cited by: footnote 1.
  • [12] B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.) (2016) Computer vision - ECCV 2016 - 14th european conference, amsterdam, the netherlands, october 11-14, 2016, proceedings,part I. External Links: Link, Document, ISBN 978-3-319-46447-3 Cited by: 16.
  • [13] J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In

    Empirical Methods in Natural Language Processing (EMNLP)

    ,
    pp. 1532–1543. External Links: Link Cited by: §4.
  • [14] K. J. Piczak (2015-10-13) ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. External Links: Link, Document, ISBN 978-1-4503-3459-4 Cited by: §2.
  • [15] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau (2016)

    Building end-to-end dialogue systems using generative hierarchical neural network models.

    .
    In AAAI, Vol. 16, pp. 3776–3784. Cited by: §1.
  • [16] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. See Computer vision - ECCV 2016 - 14th european conference, amsterdam, the netherlands, october 11-14, 2016, proceedings,part I, Leibe et al., External Links: Link, Document Cited by: §3.
  • [17] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu (2016)

    Video paragraph captioning using hierarchical recurrent neural networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4584–4593. Cited by: §1.