DeepAI AI Chat
Log In Sign Up

Audio Visual Emotion Recognition with Temporal Alignment and Perception Attention

by   Linlin Chao, et al.

This paper focuses on two key problems for audio-visual emotion recognition in the video. One is the audio and visual streams temporal alignment for feature level fusion. The other one is locating and re-weighting the perception attentions in the whole audio-visual stream for better recognition. The Long Short Term Memory Recurrent Neural Network (LSTM-RNN) is employed as the main classification architecture. Firstly, soft attention mechanism aligns the audio and visual streams. Secondly, seven emotion embedding vectors, which are corresponding to each classification emotion type, are added to locate the perception attentions. The locating and re-weighting process is also based on the soft attention mechanism. The experiment results on EmotiW2015 dataset and the qualitative analysis show the efficiency of the proposed two techniques.


Deep Fusion: An Attention Guided Factorized Bilinear Pooling for Audio-video Emotion Recognition

Automatic emotion recognition (AER) is a challenging task due to the abs...

Spatiotemporal Networks for Video Emotion Recognition

Our experiment adapts several popular deep learning methods as well as s...

Deep Auto-Encoders with Sequential Learning for Multimodal Dimensional Emotion Recognition

Multimodal dimensional emotion recognition has drawn a great attention f...

Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition

Long short-term memory (LSTM) is normally used in recurrent neural netwo...

TB or not TB? Acoustic cough analysis for tuberculosis classification

In this work, we explore recurrent neural network architectures for tube...

Recurrent Soft Attention Model for Common Object Recognition

We propose the Recurrent Soft Attention Model, which integrates the visu...