Sentiment analysis provides a beneficial mechanism to understand an individual’s attitudes, behaviors, and preferences . Understanding and analyzing context-related sentiment is an innate ability of a human being, which is also an important distinction between a machine and a human being 
. Therefore, sentiment analysis becomes a crucial issue in the field of artificial intelligence to be explored.
In recent years, sentiment analysis mainly focuses on textual data, and consequently text-based sentiment analysis is becoming mature . With the popularity of social media such as Facebook and YouTube, many users are more inclined to express their views on social media platforms with audio or video . Audio reviews become an increasing source of consumer information and are increasingly being followed with interest by companies, researchers and consumers. They also provide a more natural experience than traditional text comments due to allowing viewers to better perceive the commentator’s sentiment, belief, and intention through richer channels such as intonation . Hence, it is important to mine opinions and analyze sentiment from multiple modalities .
Modals other than text can often be used to express sentiment . The combination of multiple modalities  brings significant advantages over using only text, including language disambiguation (audio features can help eliminate ambiguous language meanings) and language sparsity issues (audio features can bring additional emotional information). Also, basic audio patterns can enhance links to the real world environment. Actually, people often associate information with learning and interact with the external environment through multiple modalities such as audio and text . Consequently, multimodal learning becomes a new effective method for sentiment analysis. Its main challenge lies in inferring joint representations that can process and connect information from multiple modalities .
In this paper, we propose a novel fusion strategy, including the multi-feature fusion and the multi-modality fusion, to improve the accuracy of multimodal sentiment analysis based on audio and text. We call it the DFF-ATMF model, and the learned features have strong complementarity and robustness. We conduct experiments on the CMU Multimodal Opinion-level Sentiment Intensity (CMU-MOSI)  dataset and the recently released CMU Multimodal Opinion Sentiment and Emotion Intensity (CMU-MOSEI)  dataset, both collected from YouTube, and make comparisons with other state-of-the-art models to show the very competitive performance of our proposed model. It is worth mentioning that DFF-ATMF also achieves the most advanced results on the IEMOCAP dataset in the generalized verification experiments, meaning that it has a good generalization ability for multimodal emotion recognition.
The major contributions of this paper are as follows:
We propose the DFF-ATMF model for audio-text sentiment analysis, combining the multi-feature fusion with the multi-modality fusion to learn more comprehensive sentiment information.
The features learned by the DFF-ATMF model have good complementarity and excellent robustness, and even show amazing performances when generalized to emotion recognition tasks.
The rest of this paper is structured as follows. In the following section, we review related work. We exhibit the details of our proposed methodology in Section 3. Then, in Section 4, experimental results and further discussions are presented. Finally, we conclude this paper in Section 5.
2 Related Work
2.1 Audio Sentiment Analysis
Audio data are usually extracted from the characteristics of audio’s channel, excitation and prosody. The prosody parameters extracted from segments, sub-segments and hyper-segments are used for sentiment analysis in 
. In the past several years, classical machine learning algorithms, such as Hidden Markov Model (HMM), Support Vector Machine (SVM), and decision tree-based methods, have been utilized for audio sentiment analysis[12, 13, 14]
. Recently, researchers have proposed various neural network-based architectures to improve the performance of audio sentiment analysis. In 2014, an initial study employed deep neural networks (DNNs) to extract high-level features from raw audio data and demonstrated its effectiveness in audio sentiment analysis
. With the development of deep learning, more complex neural-based architectures have been proposed. For example, convolutional neural network (CNN)-based models have been used to train spectrograms or audio features derived from original audio signals such as Mel Frequency Cepstral Coefficients (MFCCs) and Low Level Descriptors (LLDs)[16, 17, 18].
2.2 Text Sentiment Analysis
After decades of development, text sentiment analysis has become mature in recent years 
. The most commonly used classification techniques such as SVM, maximum entropy and naive Bayes, are based on the word bag model, where the sequence of words is ignored, which may result in inefficient extraction of sentiment from the input because the sequence of words will affect the existing sentiment. Later research has overcome this problem by using deep learning in sentiment analysis . For instance, a DNN model is proposed, using word-level, character-level and sentence-level representations for sentiment analysis . In order to better capture the temporal information,  proposes a novel neural architecture, called Transformer-XL, that enables learning dependency beyond a fixed length without disrupting temporal coherence. It consists of a segment-level recurrence mechanism and a novel positional encoding scheme, which not only enables capturing longer-term dependency, but also resolves the context fragmentation problem.
2.3 Multimodal Learning
Multimodal learning is an emerging field of research . Learning from multiple modalities needs to capture the correlation among these modalities. The data from different modalities may have different predictive power and noise topology, with possibly losing information of at least one of the modalities .  presents a novel feature fusion strategy that proceeds in a hierarchical fashion for multimodal sentiment analysis. 
proposes a recurrent neural network based multimodal attention framework that leverages the contextual information for utterance-level sentiment prediction, and shows a state-of-the-art model on the CMU-MOSI and CMU-MOSEI datasets.
3 Proposed Methodology
In this section, we describe the proposed DFF-ATMF model for audio-text sentiment analysis in details. We firstly introduce an overview of the whole neural network architecture, illustrating how to fuse the two audio and text modalities. After that, two separate branches of DFF-ATMF are respectively explained to show how to fuse the audio feature vector and the text feature vector. Finally, we present the fusion mechanism used in the DFF-ATMF model.
3.1 The DFF-ATMF Model
The overall architecture of the proposed DFF-ATMF model is shown in Figure 1. We fuse the two audio and text modalities in the DFF-ATMF model that has two parallel branches, the audio modality based branch and the text modality based branch. The model’s core mechanisms are the feature vector fusion and the multimodal-attention fusion. The audio modality branch uses Bi-LSTM 
to extract audio sentiment information between adjacent utterances (U1, U2, U3), while another branch uses the same network architecture to extract text features. Furthermore, the audio feature vector of each piece of utterance is used as the input of our proposed neural network, which is based on the audio feature fusion, so we can obtain a new feature vector before the softmax layer, called the audio sentiment vector (ASV). The text sentiment vector (TSV) can be achieved similarly. Finally, after the multimodal-attention fusion, the output of the softmax layer produces the final sentiment analysis results, as shown in Figure 1.
3.2 Audio Sentiment Vector (ASV) from Audio Feature Fusion (AFF)
|1 Chromagram from spectrogram (chroma_stft)||LSTM||43.24||20.23||13.96|
|2 Chroma Energy Normalized (chroma_cens)||LSTM||42.98||20.87||13.31|
|3 Mel-frequency cepstral coefficients (MFCC)||LSTM||55.12||23.64||16.99|
|4 Root-Mean-Square Energy (RMSE)||LSTM||52.30||21.14||15.33|
|7 Tonal Centroid Features (tonnetz)||LSTM||53.78||22.67||15.83|
Base on the work in , we reproduce and extend the experiments of the audio feature combination on the CMU-MOSI dataset, and the results are shown in Table 1. In addition, we also implement an improved serial neural network of Bi-LSTM and CNN , combining with the attention mechanism to learn the deep features of different sound representations. The multi-feature fusion procedure is described with the LSTM branch and the CNN branch respectively in Algorithm 1. The features are learned from raw waveforms and acoustic features, which are complementary to each other. Therefore, audio sentiment analysis can be improved by applying our feature fusion technique, that is, ASV from AFF, whose architecture is shown in Figure 2.
In terms of raw audio waveforms, taking the CMU-MOSI dataset as an example, we illustrate their sampling distribution in Figure 3. The inputs to the network are raw audio waveforms sampled at 22 kHz. We also scale the waveforms to be in the range [-256, 256], so that we do not need to subtract the mean value as the data are naturally near zero already. To obtain a better sentiment analysis accuracy, batch normalization (BN) and the ReLU function are employed after each convolutional layer. Additionally, dropout regularization is also applied to the proposed serial network architecture.
In terms of acoustic features, we extract them using the Librosa  toolkit, and obtain four most effective kinds of features to represent sentiment information, which are MFCCs, spectral_centroid, chroma_stft and spectral_contrast, respectively. In particular, taking log-mel spectrogram extraction 
as an example, we use 44.1 kHz without downsampling and extract the spectrograms with 64 bin mel-scale. The window size for short-time Fourier transform is 1,024 samples with a hop size of 512 samples. The resulting mel-spectrograms are next converted into log-scaled ones, and standardized by subtracting the mean value and divided by the standard deviation.
Finally, we feed the feature vectors of raw waveforms and acoustic features into our improved serial neural network of Bi-LSTM and CNN, combining with the attention mechanism to learn the deep features of different sound representations, that is, ASV.
3.3 Text Sentiment Vector (TSV) from Text Feature Fusion (TFF)
The architecture of TSV from TFF is shown in Figure 4. BERT  is a new language representation model, standing for Bidirectional Encoder Representations from Transformers. Thus far, to the best of our knowledge, no studies have leveraged BERT to pre-train text feature representations on the multimodal dataset such as the CMU-MOSI dataset. We are the first to utilize BERT embeddings for the CMU-MOSI dataset. Next, the Bi-LSTM layer takes the concatenated word embeddings and POS tags as its inputs, and outputs each hidden state. Let be the output hidden state at time . Then its attention weight can be formulated as follows:
denotes a linear transformation of. Therefore, the output representation is given by:
Based on such text representations, the sequence of features will be assigned with different attention weights. Thus, crucial information such as emotional words can be identified more easily. The convolutional layer takes the text representation as its input, and the output CNN feature maps are concatenated together. Finally, text sentiment analysis can be improved by using TSV from TFF.
3.4 Audio and Text Modal Fusion with the Multimodal-Attention Mechanism
Inspired by human visual attention, the attention mechanism, proposed by 
for neural machine translation, is introduced into the encoder-decoder framework to select the reference words from the source language for the words in the target language. Based on the existing attention mechanism, inspired by the work in, we improve the multimodal-attention method based on the multi-feature fusion strategy mentioned above, focusing on the fusion of comprehensive and complementary sentiment information from audio and text. We leverage the multimodal-attention mechanism to preserve the intermediate outputs of the input sequences by retaining the Bi-LSTM encoder, and then a model is trained to selectively learn these inputs and to correlate the output sequences with the model’s output.
More specifically, ASV and TSV are firstly encoded with Audio-BiLSTM and Text-BiLSTM using the following equations:
where is the LSTM function with the weight parameter . , and represent the hidden states at time , and from the audio modality, respectively. and represent the features at time and , respectively. The text modality is similar, represented by .
We then consider the final ASV as an intermediate vector, as shown in Figure 1. During each time step , the dot product of the intermediate vector and the hidden state is evaluated to calculate a similarity score . Using this score as a weight parameter, the weighted sum is calculated to generate a multi-feature fusion vector . The multi-feature fusion vector of the text modality is calculated similarly, represented by . We are therefore able to obtain two kinds of multi-feature fusion vectors for the audio modality and the text modality respectively, as shown in Equation (4). These multi-feature fusion vectors are respectively concatenated with the final intermediate vectors of ASV and TSV, which will be passed through the softmax function to perform sentiment analysis, as shown in Equation (5).
4 Empirical Evaluation
In this section, we firstly introduce the datasets, the evaluation metrics and the network structure parameters used in our experiments, and then exhibit the experimental results and make comparisons with other state-of-the-art models to show the advantages of DFF-ATMF. At last, more discussions are illustrated to understand the learning behavior of DFF-ATMF better.
4.1 Experiment Settings
|CMU-MOSEI||18 051||1 550||4 625||679|
|IEMOCAP||4 290||120||1 208||31|
The datasets used for training and test in our experiments are depicted in Table 2. The CMU-MOSI dataset is rich in sentiment expression, consisting 2,199 utterances, that is, 93 videos by 89 speakers. The videos involve a large array of topics such as movies, books, and products. These videos were crawled from YouTube and segmented into utterances where each utterance is annotated with scores between (strongly negative) and +3 (strongly positive) by five annotators. We take the average of these five annotations as the sentiment polarity and then consider only two classes, that is, “positive” and “negative”. Our training and test splits of the dataset are completely disjoint with respect to speakers. In order to better compare with the previous work, similar to , we divide the dataset by 7:3 approximately, resulting in 1,616 and 583 utterances for training and test respectively.
The CMU-MOSEI dataset is an upgraded version of the CMU-MOSI dataset, which has 3,229 videos, that is, 22,676 utterances, from more than 1,000 online YouTube speakers. The training and test sets include 18,051 and 4,625 utterances respectively, similar to .
The IEMOCAP dataset was collected following theatrical theory in order to simulate natural dyadic interactions between actors. We use categorical evaluations with majority agreement, and use only four emotional categories, that is, “happy”, “sad”, “angry”, and “neutral” to compare the performance of our model with other researches using the same categories .
4.1.2 Evaluation Metrics
We evaluate the performance of our proposed model by the weighted accuracy on 2-class or multi-class classification.
Additionally, F1-Score is used to evaluate 2-class classification.
In Equation (7),
represents the weight between precision and recall. During our evaluation process, we set= 1 since we consider precision and recall to have the same weight, and thus -score is adopted.
However, in emotion recognition, we use Macro -Score to evaluate the performance.
In Equation (8), represents the number of classifications and is the score on the category.
4.1.3 Network Structure Parameters
Our proposed architecture is implemented on the open-source deep learning framework Tensorflow. More specifically, for the proposed audio and text multi-modality fusion framework, we use Bi-LSTM withneurons, each followed by a dense layer consisting of
neurons. Utilizing the dense layer, we project the input features of audio and text to the same dimensions, and next combine them with the multimodal-attention mechanism. We set the dropout hyperparameter to befor CMU-MOSI and for CMU-MOSEI & IEMOCAP as a measure of regularization. We also use the same dropout rates for the Bi-LSTM layers. We employ the ReLu function in the dense layers, and softmax in the final classification layer. When training the network, we set the batch size to be
, and use Adam optimizer with the cross-entropy loss function and train forepochs. In data processing, we make each utterance one-to-one correspondence with the label and rename the utterance.
The network structure of the proposed audio and text multi-feature fusion framework is similar. Taking the audio multi-feature fusion framework as an example, the hidden states of Bi-LSTM are of -dim. The kernel sizes of CNN are , , and respectively. The size of feature maps is . The dropout rate is a random number between and . The loss function used is MAE, and the batch size is set to . We combine the training set and the development set in our experiments. We use 90% for training and reserve 10% for cross validation. To train the feature encoder, we follow the fine-tuning training strategy.
In order to reduce the randomness and improve the credibility, we report the average value over runs for all experiments.
4.2 Experimental Results
4.2.1 Comparison with Other Models
|Acc(%)||F1||Acc(%)||F1||Overall Acc(%)||Macro F1|
 proposes an LSTM-based model that enables utterances to capture contextual information from their surroundings in the video, thus aiding the classification.
 introduces attention-based networks to improve both context learning and dynamic feature fusion.
 proposes a novel multimodal fusion technique called Dynamic Fusion Graph (DFG).
 explores three different deep-learning based architectures, each improving upon the previous one, which is the state-of-the-art method on the IEMOCAP dataset at present.
 proposes a recurrent neural network based multimodal-attention framework that leverages the contextual information, which is the state-of-the-art model on the CMU-MOSI dataset at present.
Table 3 shows the comparison of DFF-ATMF with other state-of-the-art models. From Table 3, we can see that DFF-ATMF outperforms the other models on the CMU-MOSI dataset and the IEMOCAP dataset. At the same time, the experimental results on the CMU-MOSEI dataset also show DFF-ATMF’s competitive performance.
4.2.2 Generalization Ability Analysis
In order to verify the feature complementarity of our proposed fusion strategy and its robustness, we conduct experiments on the IEMOCAP dataset to examine DFF-ATMF’s generalization capability. Surprisingly, our proposed fusion strategy is effective on the IEMOCAP dataset and outperforms the current state-of-the-art method in , which can be seen from Table 3 and the overall accuracy is improved by 3.17%. More detailed experimental results on the IEMOCAP dataset are illustrated in Table 4.
4.3 Further Discussions
The above experimental results have already shown that DFF-ATMF can improve the performance of audio-text Sentiment analysis. We now analyze the attention values to understand the learning behavior of the proposed architecture better.
We take a video from the CMU-MOSI test set as an example. From the attention heatmap in Figure 5, we can see evidently that by applying different weights across contextual utterances and modalities, the model is able to predict labels of all the utterances correctly, which shows that our proposed fusion strategy with multi-feature and multi-modality is indeed effective, and thus has good feature complementarity and excellent robustness of generalization ability. However, at the same time, we have a doubt about the multi-feature fusion. When the raw waveform of the audio is fused with the vector of acoustic features, the dimensions are inconsistent. If the existing method is utilized to reduce the dimension, some audio information may also be lost. We intend to solve this problem from the perspective of some mathematical theory such as the angle between two vectors.
Similarly, the attention weight distribution heatmaps on the CMU-MOSEI and IEMOCAP test sets are shown in Figure 6 and 7, respectively. Furthermore, we also give the softmax attention weight comparison of the CMU-MOSI, CMU-MOSEI, and IEMOCAP test sets in Figure 8.
In this paper, we propose a novel fusion strategy, including multi-feature fusion and multi-modality fusion, and the learned features have strong complementarity and robustness, leading to the most advanced experimental results on the audio-text multimodal sentiment analysis tasks. Experiments on both the CMU-MOSI and CMU-MOSEI datasets show that our proposed model is very competitive. More surprisingly, the experiments on the IEMOCAP dataset achieve unexpected state-of-the-art results, indicating that DFF-ATMF can also be generalized for multimodal emotion recognition. In this paper, we did not consider the video modality because we try to use only the information of audio and text derived from videos. To the best of our knowledge, this is the first attempt in the multimodal domain. In future, we will consider more fusion strategies supported by basic mathematical theories for multimodal sentiment analysis.
This work was supported by the Fundamental Research Funds for the Central Universities (Grant No. 2016JX06); and the World-Class Discipline Construction and Characteristic Development Guidance Funds for Beijing Forestry University (Grant No. 2019XKJS0310).
-  Lei Zhang, Shuai Wang, and Bing Liu. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1253, 2018.
-  Robert V Kozinets, Daiane Scaraboto, and Marie-Agnès Parmentier. Evolving netnography: how brand auto-netnography, a netnographic sensibility, and more-than-human netnography can transform your research. JOURNAL OF MARKETING MANAGEMENT, 34(3-4):231–242, 2018.
-  Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion, 37:98–125, 2017.
-  Soujanya Poria, Amir Hussain, and Erik Cambria. Combining textual clues with audio-visual information for multimodal sentiment analysis. In Multimodal Sentiment Analysis, pages 153–178. Springer, 2018.
-  Navonil Majumder, Devamanyu Hazarika, A Gelbukh, Erik Cambria, and Soujanya Poria. Multimodal sentiment analysis using hierarchical fusion with context modeling. Knowledge-Based Systems, 161:124–133, 2018.
-  AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2236–2246, 2018.
-  Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2019.
-  Soujanya Poria, Navonil Majumder, Devamanyu Hazarika, Erik Cambria, Alexander Gelbukh, and Amir Hussain. Multimodal sentiment analysis: Addressing key issues and setting up the baselines. IEEE Intelligent Systems, 33(6):17–25, 2018.
-  Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. CoRR, abs/1606.06259, 2016.
Deepanway Ghosal, Md Shad Akhtar, Dushyant Chauhan, Soujanya Poria, Asif Ekbal,
and Pushpak Bhattacharyya.
Contextual inter-modal attention for multi-modal sentiment analysis.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3454–3466, 2018.
Zhen-Tao Liu, Min Wu, Wei-Hua Cao, Jun-Wei Mao, Jian-Ping Xu, and Guan-Zheng
Speech emotion recognition based on feature selection and extreme learning machine decision tree.Neurocomputing, 273:271–280, 2018.
-  Björn Schuller, Gerhard Rigoll, and Manfred Lang. Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages I–577. IEEE, 2004.
-  Björn Schuller, Gerhard Rigoll, and Manfred Lang. Hidden markov model-based speech emotion recognition. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 2, pages II–1. IEEE, 2003.
-  Chi-Chun Lee, Emily Mower, Carlos Busso, Sungbok Lee, and Shrikanth Narayanan. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication, 53(9-10):1162–1171, 2011.
-  Kun Han, Dong Yu, and Ivan Tashev. Speech emotion recognition using deep neural network and extreme learning machine. In The fifteenth annual conference of the international speech communication association (INTERSPEECH), pages 223–227, 2014.
-  Dario Bertero and Pascale Fung. A first look into a convolutional neural network for speech emotion detection. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5115–5119. IEEE, 2017.
-  Srinivas Parthasarathy and Ivan Tashev. Convolutional neural network techniques for speech emotion recognition. In The 16th International Workshop on Acoustic Signal Enhancement (IWAENC), pages 121–125. IEEE, 2018.
-  Shervin Minaee and Amirali Abdolrashidi. Deep-emotion: Facial expression recognition using attentional convolutional network. arXiv preprint arXiv:1902.01019, 2019.
-  Doaa Mohey El-Din Mohamed Hussein. A survey on sentiment analysis challenges. Journal of King Saud University-Engineering Sciences, 30(4):330–338, 2018.
-  Iti Chaturvedi, Erik Cambria, Roy E Welsch, and Francisco Herrera. Distinguishing between facts and opinions for sentiment analysis: Survey and challenges. Information Fusion, 44:65–77, 2018.
-  Jianqiang Zhao, Xiaolin Gui, and Xuejun Zhang. Deep convolution neural networks for twitter sentiment analysis. IEEE Access, 6:23253–23260, 2018.
-  Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860, 2019.
-  Xiaofeng Cai and Zhifeng Hao. Multi-view and attention-based bi-lstm for weibo emotion recognition. In 2018 International Conference on Network, Communication, Computer Engineering (NCCE). Atlantis Press, 2018.
-  Ziqian Luo, Hua Xu, and Feiyang Chen. Audio sentiment analysis by heterogeneous signal features learned from utterance-based parallel neural network. In Proceedings of the AAAI-19 Workshop on Affective Content Analysis, Honolulu, USA, AAAI, 2019.
-  Chuhan Wu, Fangzhao Wu, Junxin Liu, Zhigang Yuan, Sixing Wu, and Yongfeng Huang. Thu_ngn at semeval-2018 task 1: Fine-grained tweet sentiment intensity analysis with attention cnn-lstm. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 186–192, 2018.
-  Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, pages 18–25, 2015.
-  Yifang Yin, Rajiv Ratn Shah, and Roger Zimmermann. Learning and fusing multimodal deep features for acoustic scene categorization. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 1892–1900. ACM, 2018.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
-  Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 112–118. IEEE, 2018.
-  Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 873–883, 2017.
-  Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Mazumder, Amir Zadeh, and Louis-Philippe Morency. Multi-level multiple attentions for contextual multimodal sentiment analysis. In 2017 IEEE International Conference on Data Mining (ICDM), pages 1033–1038. IEEE, 2017.
-  Chan Woo Lee, Kyu Ye Song, Jihoon Jeong, and Woo Yong Choi. Convolutional attention networks for multimodal emotion recognition from speech and text data. arXiv preprint arXiv:1805.06606, 2018.