Multimodal processing aims to model interactions between inputs that come from different sources in real world tasks. Multimodality can open ways to develop novel applications (e.g. Image Captioning, Visual Question Answering[17, 15] etc.) or boost performance in traditionally unimodal applications (e.g. Machine Translation , Speech Recognition [36, 43] etc.). Moreover, modern advances in neuroscience and psychology hint that multi-sensory inputs are crucial for cognitive functions , even since infancy . Thus, modeling and understanding multimodal interactions can open avenues to develop smarter agents, inspired by the human brain.
Feedback loops have been shown to exist in the human brain, e.g. in the case of vocal production  or visual-motor coordination . Human perception has been traditionally modelled as a linear (bottom-up) process (e.g. reflected light is captured by the eye, processed in the prefrontal visual cortex, then the posterior visual cortex etc.). Recent studies have highlighted that this model may be too simplistic and that high level cognition may affect low-level visual [3, 45] or audio  perception. For example, studies state that perception may be affected by an individual’s long-term memory , emotions  and physical state . Researchers have also tried to identify brain circuits that allow for this interplay . While scientists still debate on this subject 
, such works offer strong motivation to explore if artificial neural networks can benefit from multimodal top-down modeling.
and ensembles of Support Vector Machines. Modeling contextual information is addressed in [6, 13, 40]
using Recurrent Neural Networks (RNNs), while Poria et al.
use Convolutional Neural Networks (CNNs). For a detailed review we refer to Baltruvsaitis et al.. Later works use Kronecker product between late representations [8, 25], while others investigate architectures with neural memory-like modules [9, 10]. Hierarchical attention mechanisms , as well as hierarchical fusion  have been also proposed. Pham et al.  learn cyclic cross-modal mappings, Sun et al.  propose Deep Canonical Correlation Analysis (DCCA) for jointly learning representations. Multitask learning has been also investigated  in the multimodal context. Transformers  have been applied to and extended for multimodal tasks [22, 34, 5, 20]. Wang et al.  shift word representations based on non-verbal imformation.  propose a fusion gating mechanism.  use capsule networks  to weight input modalities and create distinct representations for input samples.
In this work we propose MMLatch, a neural network module that uses representations from higher levels of the architecture to create top-down masks for the low level input features. The masks are created by a set of feedback connections. The module is integrated in a strong late fusion baseline based on LSTM 
encoders and cross-modal attention. Our key contribution is the modeling of interactions between high-level representations extracted by the network and low-level input features, using an end to end framework. We integrate MMLatch with RNNs, but it can be adapted for other architectures (e.g. Transformers). Incorporating top-down modeling shows consistent improvements over our strong baseline, yielding state-of-the-art results for sentiment analysis on CMU-MOSEI. Qualitative analysis of learned top-down masks can add interpretability in multimodal architectures. Our code will be made available as open source.
2 Proposed Method
illustrates an overview of the system architecture. The baseline system consists of a set of unimodal encoders and a cross-modal attention fusion network, that extracts fused feature vectors for regression on the sentiment values. We integrate top-down information by augmenting the baseline system with a set of feedback connections that create cross-modal, top-down feature masks.
Unimodal Encoders: Input features for each modality are encoded using three LSTMs. The hidden states of each LSTM are then passed through a Dot Product self-attention mechanism to produce the unimodal representations , where are the audio, text and visual modalities respectively.
Cross-modal Fusion: The encoded unimodal representations are fed into a cross-modal fusion network, that uses a set of attention mechanisms to capture cross-modal interactions. The core component of this subsystem is the symmetric attention mechanism, inspired by Lu et al. . If we consider modality indicators , the input modality representations, we can construct keys , queries and values using learnable projection matrices , and we can define a cross-modal attention layer as:
where is the softmax operation and are the batch size, sequence length and hidden size respectively. For the symmetric attention we sum the two cross-modal attentions:
In the fusion subsystem we use three symmetric attention mechanisms to produce , and . Additionally we create using a cross-modal attention mechanism (Eq. (1)) with inputs and . These crossmodal representations are concatenated (), along with the unimodal representations to produce the fused feature vector in Eq. (3).
We then feed into a LSTM and the last hidden state is used for regression. The baseline system consists of the unimodal encoders followed by the cross-modal fusion network.
|Model / Metric||Acc@7||Acc@2||F1@2||MAE||Corr|
|Multimodal Routing ||-||-|
|Baseline + MMLatch average (ours)|
|Baseline + MMLatch best (ours)|
. In row “MMLatch average” we include results averaged over five runs. Since other works do not report standard deviation, we also include row “MMLatch best”, where we report the best of the five runs (lowest error).
Top-down fusion: We integrate top-down information by augmenting the baseline system with MMLatch, i.e. a set of feedback connections composing of three LSTMs followed by sigmoid activations . The inputs to these LSTMs are as they come out of the unimodal encoders. Feedback LSTMs produce hidden states . The feedback masks are produced by applying a sigmoid activation on the hidden states and then applied to the input features using element-wise multiplication , as:
Eq. (4) describes how the feedback masks for two modalities are applied to the input features of the third. For example, consider the case where we mask visual input features using the (halved) sum of text and audio feedback masks. If a visual feature is important for both audio and text the value of the resulting mask will be close to . If it is important for only one other modality the value will be close to , while if it is irrelevant for text and audio the value will be close to . Thus, a feature is enhanced or attenuated based on it’s overall importance for cross-modal representations.
This pipeline is implemented as a two-stage computation. During the first stage we use the unimodal encoders and MMLatch to produce the feedback masks and apply them to the input features using Eq. (4). During the second stage we pass the masked features through the unimodal encoders and the cross-modal fusion module and use the fused representations for regression.
3 Experimental setup
We use CMU-MOSEI sentiment analysis dataset  for our experiments. The dataset contains YouTube video clips of movie reviews accompanied by human annotations for sentiment scores from -3 (strongly negative) to 3 (strongly positive) and emotion annotations. Audio sequences are sampled at and then COVAREP features are extracted. Visual sequences are sampled at and represented using Facet features. Video transcriptions are segmented in words and represented using GloVe. All sequences are word-aligned using P2FA. Standard train, validation and test splits are provided.
For all our experiments we use bidirectional LSTMs with hidden size . LSTMs are bidirectional and forward and backward passes are summed. All projection sizes for the attention modules are set to . We use dropout . We use Adam  with learning rate and halve the learning rate if the validation loss does not decrease for epochs. We use early stopping on the validation loss (patience
epochs). During Stage I of each training step we disable gradients for the unimodal encoders. Models are trained for regression on sentiment values using Mean Absolute Error (MAE) loss. We use standard evaluation metrics:-class, -class accuracy (i.e. classification in , ), binary accuracy and F1-score (negative in , positive in ), MAE and correlation between model and human predictions. For fair comparison we compare with methods in the literature that use Glove text features, COVAREP audio features and FACET visual features.
|Multimodal Encoder||Feedback Type||Acc@7||Acc@2||F1@2||MAE||Corr|
|Baseline||MMlatch (no LSTM)|
Table 1 shows the results for sentiment analysis on CMU-MOSEI. The Baseline row refers to our late-fusion baseline described in Section 2, which achieves competitive to the state-of-the-art performance. Incorporating MMLatch into the baseline constistently improves performance and specifically, almost over the binary accuracy and over the seven class accuracy. Moreover, we observe lower deviation, w.r.t. the baseline, across experiments, indicating that top-down feedback can stabilize training. Compared to state-of-the-art we achieve better performance for 7-class accuracy and binary F1 metrics in our five run experiments. Since, prior works do not report average results over multiple runs so we also report results for the best (mean absolute error) out of five runs in the last row of Table 1, showing improvements across metrics over the best runs of the other methods.
In Table 2 we evaluate MMLatch with different multimodal encoders and different feedback types. The first three rows show the effect of using different feedback types. Specifically, first row shows our baseline performance (no feedback). For the second row we add feedback connections, but instead of using LSTMs in the feedback loop (Stage I in Fig. 1), we use a simple feed-forward layer. The last row shows performance when we include LSTMs in the feedback loop. We observe that, while the inclusion of top-down feedback, using a simple projection layer results to a small performance boost, when we include an LSTM in the feedback loop we get significant improvements. This shows that choosing an appropriate mapping from high-level representations to low-level features in the feedback loop is important.
For the last two rows of Table 2 we integrate MMLatch with MulT architecture111We use the original code in this GitHub Link . Specifically, we use MMLatch, as shown in Fig. 1 and swap the baseline architecture (unimodal encoders and cross-modal fusion) with MulT. We use a
-layer Transformer model with the same hyperparameter set and feature set described in the original paper. The output of the fourth (final) layer is used by MMLatch to mask the input features. First, we notice a performance gap between our reproduced results and the ones reported in the original paper (fourth row of Table 2). Other works [46, 42] have reported similar observations. We observe that the integration of MMLatch with MulT yields significant performance improvements across metrics. Furthermore, similarly to Table 1, we observe that the inclusion of MMLatch reduces standard deviation across metrics. Overall, we observe that the inclusion of MMLatch results to performance improvements for both our baseline model and MulT with no additional tuning, indicating that top-down feedback can provide stronger multimodal representations.
Fig. 2 shows a heatmap of the average mask values . This mask is applied to the input visual features , i.e. Facet features. The average mask values range from to and depicted across sentiment classes. Some features are attenuated or enhanced across all classes (e.g. features or ). Interestingly, some features are attenuated for some classes and enhanced for others (e.g. feature ). More importantly this transition is smooth, i.e. mask values change almost monotonically as the sentiment value increases from to , indicating welll-behaved training of MMlatch. We observe the same for Covarep masks.
We introduce MMLatch, a feedback module that allows modeling top-down cross-modal interactions between higher and lower levels of the architecture. MMLatch is motivated by recent advances in cognitive science, analyzing how cognition affects perception and is implemented as a plug and play framework that can be adapted for modern neural architectures. MMLatch improves model performance over our proposed baseline and over MulT. The combination of MMLatch with our baseline achieves state-of-the-art results. We believe top-down cross-modal modeling can augment traditional bottom-up pipelines, improve performance in multimodal tasks and inspire novel multimodal architectures.
In this work, we implement top-down cross-modal modeling as an adaptive feature masking mechanism. In the future, we plan to explore more elaborate implementations that directly affect the state of the network modules from different levels in the network. Furthermore, we aim to extend MMLatch to more tasks, diverse architectures (e.g. Transformers) and for unimodal architectures. Finally, we want to explore the applications top-down masks for model interpretability.
-  (2010) Wishful seeing: more desired objects are seen as closer. Psychological science. Cited by: §1.
-  (2018) Multimodal machine learning: a survey and taxonomy. Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
-  (2013) Top-down effects in visual. The Oxford Handbook of Cognitive Neuroscience, Volume 2. Cited by: §1.
-  (2019) Probing the need for visual context in multimodal machine translation. In Proc. NAACL), Cited by: §1.
-  (2020) A transformer-based joint-encoding for emotion recognition and sentiment analysis. In 2nd Challenge-HML, Cited by: §1.
-  (2012) Context-sensitive learning for enhanced audiovisual emotion classification. Transactions on Affective Computing. Cited by: §1.
-  (2017) Attention is all you need. In Proc. 31st NeurIps, Cited by: §1.
-  (2017) Tensor fusion network for multimodal sentiment analysis. In Proc. EMNLP, Cited by: §1.
-  (2018) Multi-attention recurrent network for human communication comprehension. Proc. AAAI. Cited by: §1.
-  (2018) Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proc. 56th ACL, Cited by: §1, §3.
-  (2011) Emotion recognition using a hierarchical binary decision tree approach. Speech Communication. Cited by: §1.
-  (2019) Found in translation: learning robust joint representations by cyclic translations between modalities. In Proc. AAAI, Cited by: §1, Table 1.
-  (2013) LSTM-modeling of continuous emotions in an audiovisual affect recognition framework. Image and Vision Computing. Cited by: §1.
-  (2006) Development of multisensory spatial integration and perception in humans. Developmental science. Cited by: §1.
-  (2016) Image captioning with semantic attention. In Proc. CVPR, Cited by: §1.
-  (2019) Visual feedback during motor performance is associated with increased complexity and adaptability of motor and neural output. Behavioural Brain Research. Cited by: §1.
-  (2015) Vqa: visual question answering. In Proc. CVPR, Cited by: §1.
-  (2015) A top-down cortical circuit for accurate sensory perception. Neuron. Cited by: §1.
-  (2012) Ensemble of svm trees for multimodal emotion recognition. In Proc. APSIPA, Cited by: §1.
-  (2020) Integrating multimodal information in large pretrained transformers. In Proc. 58th ACL, Cited by: §1.
-  (2018) Multimodal affective analysis using hierarchical attention strategy with word-level alignment. In Proc. ACL, Cited by: §1.
-  (2019) Multimodal transformer for unaligned multimodal language sequences. In Proc. 57th ACL, Cited by: §1, Table 1, §4.
-  (2020) Multimodal routing: improving local and global interpretability of multimodal language analysis. In Proc. EMNLP, Cited by: §1, Table 1.
-  (2019) Words can shift: dynamically adjusting word representations using nonverbal behaviors. In Proc. AAAI, Cited by: §1, Table 1.
-  (2018) Efficient low-rank multimodal fusion with modality-specific factors. In Proc. 56th ACL, Cited by: §1.
-  (2014) “Top-down” effects where none should be found: the el greco fallacy in perception research. Psychological science. Cited by: §1.
-  (2019) Deep hierarchical fusion with application in sentiment analysis. Proc. Interspeech. Cited by: §1.
-  (1997) Long short-term memory. Neural computation. Cited by: §1.
-  (2015) The cortical computations underlying feedback control in vocal production. Current opinion in neurobiology. Cited by: §1.
-  (2015) Adam: A method for stochastic optimization. In 3rd ICLR, Y. B. and Y. L. (Eds.), Cited by: §3.
-  (2020) Multi-modal embeddings using multi-task learning for emotion recognition. Proc. Interspeech. Cited by: §1.
-  (2012) Current perspectives and methods in studying neural mechanisms of multisensory interactions. Neuroscience & Biobehavioral Reviews. Cited by: §1.
-  (2020) Gated mechanism for attention based multi modal sentiment analysis. In ICASSP, Cited by: §1.
-  (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proc. 32nd NeurIps, Cited by: §1, §2.
-  (2017) Objective effects of knowledge on visual perception.. Journal of experimental psychology: human perception and performance. Cited by: §1.
-  (2020) Multimodal and multiresolution speech recognition with transformers. In Proc. 58th ACL, Cited by: §1.
-  (2016) Convolutional mkl based multimodal emotion recognition and sentiment analysis. In Proc. ICDM, Cited by: §1.
-  (1995) Perceiving geographical slant. Psychonomic bulletin & review. Cited by: §1.
-  (2017) Dynamic routing between capsules. In Proc. 30th NeurIps, Cited by: §1.
-  (2020) Multilogue-net: a context-aware RNN for multi-modal emotion detection and sentiment analysis in conversation. In Proc. 2nd Challenge-HML, Cited by: §1.
-  (2012) Predictive top-down integration of prior knowledge during speech perception. Journal of Neuroscience. Cited by: §1.
-  (2021) Lightweight models for multimodal sequential data. In Proc. 11th WASSA, Cited by: §4.
-  (2020) Multimodal speech recognition with unstructured audio masking. In Proc. 1st Workshop on NLPBT, Cited by: §1.
-  (2019) Multi-modal sentiment analysis using deep canonical correlation analysis. Proc. Interspeech. Cited by: §1.
-  (2017) How to (and how not to) think about top-down influences on visual perception. Consciousness and Cognition. Cited by: §1.
-  (2021) Cross-modal context-gated convolution for multi-modal sentiment analysis. Pattern Recognition Letters. Cited by: §4.