Multimodal Categorization of Crisis Events in Social Media

by   Mahdi Abavisani, et al.
University of California-Davis

Recent developments in image classification and natural language processing, coupled with the rapid growth in social media usage, have enabled fundamental advances in detecting breaking events around the world in real-time. Emergency response is one such area that stands to gain from these advances. By processing billions of texts and images a minute, events can be automatically detected to enable emergency response workers to better assess rapidly evolving situations and deploy resources accordingly. To date, most event detection techniques in this area have focused on image-only or text-only approaches, limiting detection performance and impacting the quality of information delivered to crisis response teams. In this paper, we present a new multimodal fusion method that leverages both images and texts as input. In particular, we introduce a cross-attention module that can filter uninformative and misleading components from weak modalities on a sample by sample basis. In addition, we employ a multimodal graph-based approach to stochastically transition between embeddings of different multimodal pairs during training to better regularize the learning process as well as dealing with limited training data by constructing new matched pairs from different samples. We show that our method outperforms the unimodal approaches and strong multimodal baselines by a large margin on three crisis-related tasks.


page 1

page 2


Multimodal Classification of Events in Social Media

A large amount of social media hosted on platforms like Flickr and Insta...

Analysis of Social Media Data using Multimodal Deep Learning for Disaster Response

Multimedia content in social media platforms provides significant inform...

Clustering of Social Media Messages for Humanitarian Aid Response during Crisis

Social media has quickly grown into an essential tool for people to comm...

Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response

During a disaster event, images shared on social media helps crisis mana...

Automatic Image Filtering on Social Networks Using Deep Learning and Perceptual Hashing During Crises

The extensive use of social media platforms, especially during disasters...

Social Media Images Classification Models for Real-time Disaster Response

Images shared on social media help crisis managers in terms of gaining s...

Leveraging Personal Navigation Assistant Systems Using Automated Social Media Traffic Reporting

Modern urbanization is demanding smarter technologies to improve a varie...

1 Introduction

Each second, billions of images and texts that capture a wide range of events happening around us are uploaded to social media platforms from all over the world. At the same time, the fields of Computer Vision (CV) and Natural Language Processing (NLP) are rapidly advancing

[24, 22, 14]

and are being deployed at scale. With large-scale visual recognition and textual understanding available as fundamental tools, it is now possible to identify and classify events across the world in real-time. This is possible, to some extent, in images and text separately, and in limited cases, using a combination. A major difficulty in crisis events,

111An event that is going (or is expected) to lead to an unstable and dangerous situation affecting an individual, group, community, or whole society (from Wikipedia); typically requiring an emergency response. in particular, is that as events surface and evolve, users post fragmented, sometimes conflicting information in the form of image-text pairs. This makes the automatic identification of notable events significantly more challenging.

Figure 1: A Crisis-related Image-text Pair from Social Media

Unfortunately, in the middle of a crisis, the information that is valuable for first responders and the general public often comes in the form of image-text pairs. So while traditional CV and NLP methods that treat visual and textual information separately can help, a big gap exists in current approaches. Despite the general consensus on the importance of using AI for Social Good [21, 19, 4], the power of social media, and a long history of interdisciplinary research on humanitarian crisis efforts, there has been very little work on automatically detecting crisis events jointly using visual and textual information.

Prior approaches that tackle the detection of crisis events have focused on either image-only or text-only approaches. As shown in Figure 1, however, an image alone can be ambiguous in terms of its urgency whereas the text alone may lack details.

To address these issues, we propose a framework to detect crisis events using a combination of image and text information. In particular, we present an approach to automatically label images, text, and image-text pairs based on the following criteria/tasks: 1) Informativeness: whether the social media post is useful for providing humanitarian aid in an emergency event, 2) Event Classification: identifying the type of emergency (in Figure 2, we show some of the categories that different image-text pairs belong to in our event classification task), and 3) Severity: rating how severe the emergency is based on the damage indicated in the image and text. Our framework consists of several steps in which, given an image-text pair, we create a feature map for the image, generate word embeddings for the text, and propose a cross-attention mechanism to fuse information from the two modalities. It differs from previous multimodal classification in how it deals with fusing that information.

In short, we present a novel, multimodal framework for classification of multimodal data in the crisis domain. This approach, ”Cross Attention”, avoids transferring negative knowledge between modalities and makes use of stochastic shared embeddings to mitigate overfitting in small data as well as dealing with training data with inconsistent labels for different modalities. Our model outperforms strong unimodal and multimodal baselines by up to 3 F-score points across three crisis tasks.

Figure 2: Samples from Task 2; Event Classification with Texts and Images.

2 Related Work

AI for Emergency Response:

Recent years have seen an explosion in the use of Artificial Intelligence for Social Good

[21, 19, 4]. Social media has proven to be one of most relevant and diverse resources and testbeds, whether it be for identifying risky mental states of users [10, 15, 20], recognizing emergent health hazards [16], filtering for and detecting natural disasters [49, 40, 48], or surfacing violence and aggression in social media [9].

Most prior work on detecting crisis events in social media has focused on text signals. For instance, Kumar et al[32] propose a real-time tweet-tracking system to help first responders gain situational awareness once a disaster happens. Shekhar et al[51]

introduce a crisis analysis system to estimate the damage level of properties and the distress level of victims. At a large scale, filtering (e.g., by anomaly or burst detection), identifying (e.g., by clustering), and categorizing (e.g., by classifying) disaster-related texts on social media have been the foci of multiple research groups

[54, 58, 63], achieving accuracy levels topping at on small annotated datasets collected from Twitter.

Disaster detection in images has been an active front, whether it be user-generated content or satellite images (for a detailed survey, refer to Said et al[49]). For instance, Ahmad et al[5] introduce a pipeline method to effectively link remote sensor data with social media to better assess damage and obtain detailed information about a disaster. Li et al[37]

use convolutional neural networks and visualization methods to locate and quantify damage in a disaster images. Nalluru

et al[42] combine semantic textual and image features to classify the relevancy of social media posts in emergency situations.

Our framework focuses on combining images and text, yielding performance improvements on three disaster classification tasks.

Deep Multimodal Learning: In deep multimodal learning, neural networks are used to integrate the complementary information from multiple representations (modalities) of the same phenomena [60, 43, 3, 12, 2, 44]

. In many applications, including image captioning

[8, 46], visual question answering [7, 18], and text-image matching [52, 17, 35], combining image and text signals is of interest. Thus many recent works study image-text fusion [39, 36, 56, 55].

Existing multimodal learning frameworks applied to the crisis domain are relatively limited. Lan et al[33] combine early fusion and late fusion methods to incorporate their advantages, Ilyas [27]

introduce a disaster classification system based on naive-bayes classifiers and support vector machines. Kelly

et al[29] introduce a system for real-time extraction of information from text and image content in Twitter messages with exploiting the spatio-temporal metadata for filtering, visualizing, and monitoring flooding events. Mouzannar et al[41]

propose a multimodal deep learning framework to identify damage related information on social media posts with texts, images, and video.

In the application of crisis tweets categorization, one modality may contain uninformative or even misleading information. The attention module in our model passes information based on the confidence in the usefulness of different modalities. The more confident modality blocks weak or misleading features from the other modality through their cross-attention link. The partially blocked results of both modalities are later judged by a self-attention layer to decide which information should be passed to the next layer. While our attention module is closely related to co-attention and self-attention mechanisms [59, 23, 38, 18, 26, 46], unlike them, it does not need the input features to be homogeneous. In contrast, self-attention and co-attention layers can be sensitive to heterogeneous inputs. The details of the model are described in the next section.

3 Methodology

The architecture we propose is designed for classification problems that takes as input image-text pairs such as user generated tweets in social media, as illustrated in Figure 3, where the DenseNet and BERT graphs are from  [25] and [14]. Our methodology consists of 4 parts: the first two parts extract feature maps from the image and extract embeddings from the text, respectively; the third part comprises our cross-attention approach to fuse projected image and text embeddings; and the fourth part uses Stochastic Shared Embeddings (SSE) [61] as our regularization technique to prevent over-fitting and deal with training data with inconsistent labels for image and text pairs.

Figure 3: Illustration of Our Framework. Embedding features are extracted from images and texts by DenseNet and BERT networks, respectively, and are integrated by the cross-attention module. In the training process, the embeddings of different samples are stochastically transitioned between each other to provide a robust regularization.

We describe each module in the sub-sections that follow.

3.1 Image Model for Feature Map Extraction:

We extract feature maps from images using Convolutional Neural Networks (CNNs). In our model we select DenseNet  [25], which reduces module sizes and increases connections between layers to address parameter redundancy and improve accuracy (other approaches, such as EfficientNet [57] could also be used, but DenseNet is efficient and commonly used for this task).

For each image , we therefore have:


where is the input image,

is the vectorized form of a deep feature map in the DenseNet with dimension

, where are the feature map’s height, width and number of channels respectively.

3.2 Text Model for Embedding Extraction:

Full-network pre-training [45, 14] has led to a series of breakthroughs in language representation learning. Specifically, deep-bidirectional Transformer models such as BERT [14] and its variants [62, 34] have achieved state-of-the-art results on various natural language processing tasks by leveraging close and next-sentence prediction tasks as weakly-supervised pre-training.

Therefore, we use BERT as our core model for extracting embeddings from text (variants such as XLNET [62] and ALBERT [34] could also be used). We use the BERT model pre-trained on Wiki and Books data[28] on crisis-related tweets ’s. For each text input , we have


where is a sequence of word-piece tokens and is the sentence embedding. Similar to the BERT paper [14], we take the embedding associated with [CLS] to represent the whole sentence.

In the next subsection we detail how DenseNet and BERT are fused.

3.3 Cross-attention module for avoiding negative knowledge in fusion:

After we obtain the image feature map (DenseNet) and the sentence embedding (BERT), we use a new cross-attention mechanism to fuse the information they represent. In many text-vision tasks, the input pair can contain noise. In particular, in classification of tweets, one modality may contain non-informative or even misleading information. In such a case, negative information transfer can occur. Our model can mitigate the effects of one modality over another on a case by case basis.

To address this issue, in our cross-attention module, we use a combination of cross-attention layers and a self-attention layer. In this module, each modality can block the features of the other modality based on its confidence in the usefulness of its input. This happens with the cross-attention layer. The result of partially blocked features from both modalities is later fed to a self-attention layer to decide which information should be passed to the next layer.

The self-attention layer exploits a fully-connected layer to project the image feature map into a fixed dimensionality (we use ), and similarly project the sentence embedding so that:



represents an activation function such as ReLU (used in our experiments) and both

and are of dimension .

In the case of misleading information in one modality, without an attention mechanism (such as co-attention [39]), the resulting and cannot be easily combined without hurting performance. Here, we propose a new attention mechanism called cross-attention (Figure 3), which differs from standard co-attention mechanisms: the attention mask for the image is completely dependent on the text embedding , while the attention mask for the text is completely dependent on the image embedding . Mathematically, this can be expressed as follows:



is the Sigmoid function. Co-attention, in contrast, can be expressed as follows:


where means concatenation.

After we have the attention masks for image and text, respectively, we can augment the projected image and text embeddings with and before performing concatenation or adding. In our experiments, we use concatenation but obtained similar performance using addition.

The last step of this module takes the concatenated embedding which jointly represents the image and text tuple in and feeds into the two-layer fully-connected networks. We add self-attention in the fully-connected networks and use the standard softmax cross-entropy loss for the classification.

In Section 4, we show that the combination of cross-attention layers and the self-attention layer on their concatenation works better than co-attention and self-attention mechanisms for the tasks we address in this paper.

3.4 SSE for Better Regularization

Due to unforeseeable and unpredictable nature of disasters, and also because they require fast processing and reaction, one often has to deal with limited annotations for user-generated content during crises. Using regularization techniques to mitigate this issue becomes especially important. In this section, we extend Stochastic Shared Embeddings (SSE) technique [61] to its multimodal version for taking the full advantage from the annotated data by 1) generating new artificial multimodal pairs. 2) also including the annotated data with inconsistent labels for text and image in the training process.

SSE-Graph [61]

, a variation of SSE, is a data-driven approach for regularizing embedding layers which uses a knowledge graph to stochastically make transitions between embeddings of different samples during the stochastic gradient descent (SGD). That means, during the training, based on a knowledge graph, there is a chance that embeddings of different samples being swapped. We use the text and image labels to construct knowledge graphs that can be used to create stochastic multimodal training samples with consistent labels for both the image and text.

We treat feature maps of images as embeddings and use class labels to construct knowledge graphs. The feature maps of two images are connected by an edge in the graph, if and only if they belong to the same class (e.g. they are both labeled “affected individuals”). We follow the same procedure for text embeddings and construct a knowledge graph for text embeddings as well. Finally, we connect the nodes associated with the knowledge graph of image feature maps with an edge to nodes in text’s knowledge graph if and only if they belong to the same class.

Let and

be sets of parameters. We define the transition probability

as probability of transition from to , where and are nodes in the image knowledge graph that correspond to image features and . Similarly, we define as probability of transition from to (nodes corresponding to text embeddings and , respectively).

Taking image feature maps as an example, if is connected to but not connected to in the knowledge graph, one simple and effective way to generate more multimodal pairs is to use a random walk (with random restart and self-loops) on the knowledge graph. Since we are more interested in transitions within embeddings of consistent labels, in each transition probability, we set the ratio of and to be a constant greater than . In more formal notation, we have


where is a tuning parameter and , and and denote connected and not connected nodes in the knowledge graph. We also have:


where is called the SSE probability for image features.

We similarly define and in for text embeddings. Note that is defined with respect to the image features’ label. That is


Both and parameters sets are treated as tuning hyper-parameters in experiments and can be tuned fairly easily. With Eq. (8), Eq. (7) and , we can derive transition probabilities between any two sets of feature maps in images and texts to fill out the transition probability table.

With the right parameter selection, each multimodal pair in the training can be transitioned to many more multimodal pairs that are highly likely to have consistent labels for the image and text pairs which can mitigate both the issues of limited number of training samples and inconsistency in the annotations of image-text pairs.

4 Experimental Setup

The image-text classification problem we consider can be formulated as follows: we have as input , where is the number of training tuples and the -th tuple consists of both image and text . The respective labels for and ’s are also given in training data. Our goal is to predict the correct label for any unseen pair. To simplify the evaluation, we assume there is only one correct label associated with the unseen pairs. As a result, this paper targets a multi-class classification problem instead of a multi-label problem.

4.1 Dataset

There are very few crisis datasets, and to the best of our knowledge there is only one multimodal crisis dataset, CrisisMMD [6]. It consists of annotated image-tweet pairs where images and tweets are independently labeled as described below. We use this dataset for our experiments. The dataset was collected using event-specific keywords and hashtags during seven natural disasters in 2017: Hurricane Irma, Hurricane Harvey, Hurricane Maria, the Mexico earthquake, California wildfires, Iraq-Iran earthquakes, and Sri Lanka floods. The corpus is comprised of three types of manual annotations:

Task 1: Informative vs. Not Informative: whether a given tweet text or image is useful for humanitarian aid purposes, defined as providing assistance to people in need.

Task 2: Humanitarian Categories: given an image, or tweet, or both, categorize it into one of the five following categories:

  • Infrastructure and utility damage

  • Vehicle damage

  • Rescue, volunteering, or donation efforts

  • Affected individuals (injury, dead, missing, found, etc.)

  • Other relevant information

Note that we merge the data that are labeled as injured or dead people and missing or found people in the CrisisMMD with those that are labeled as affected individuals and view all of them as one class of data.

Task 3: Damage Severity: assess the severity of damage reported in a tweet image and classify it into Severe, Mild, and Little/None.

It is important to note that while the annotations for the last task are only on images. Our experiments reveal that using tweet texts along with the images can boost performance. In addition, our paper is the first one to perform all three tasks on this dataset (text-only, image-only, combined).

4.2 Settings

Images and text from tweets in this dataset were annotated independently. Thus, in many cases, images and text in the same pairs may not share the same labels for either Task 1 or Task 2 (labels for Task 3 were only created by annotating the images). Given the different evaluation conditions, we carry out three evaluation settings for the sake of being comprehensive in our model assessment but also to establish best practices for the community: Setting A: we exclude the image-text pairs with differing labels for image and text; Setting B: we include the image-text pairs with different labels in the training set but keep the test set the same as in A.

In addition, we introduce Setting C to mimic a realistic crisis tweet classification task where we only train on events that have transpired before the event(s) in the test set.

Table 1 shows the number of samples in each set for different setting and tasks.

Setting # of Training samples # of Dev samples # of Test samples
Setting A
Task1: 7876 553 2821
Task2: 1352 540 1467
Task3: 2590 340 358
Setting B
Task1: 12680 553 2821
Task2: 5433 540 1467
Setting C
Experiment 1: 174 - 217
Experiment 2: 4037 - 217
Experiment 3: 4761 - 217
Table 1: Number of samples in different splits of our settings.

Setting A: In this setting our train and test data is sampled from tweets in which the text and image pairs have the same label. That is:


where denotes the class of data point . This results in a small, yet potentially more reliable training set. We mix the data from all seven crisis events and split the data into training, dev and test sets.

Informativeness Task Humanitarian Categorization Task Damage Severity Task
Model Acc Macro F1 Weighted F1 Acc Macro F1 Weighted F1 Acc Macro F1 Weighted F1
DenseNet [25] 81.57 79.12 81.22 83.44 60.45 86.96 62.85 52.34 66.10
BERT [14] 84.90 81.19 83.30 86.09 66.83 87.83 68.16 45.04 61.09
Compact Bilinear Pooling[18] 88.12 86.18 87.61 89.30 67.18 90.33 66.48 61.03 70.58
Compact Bilinear Gated Pooling [31] 88.76 87.50 88.80 85.34 65.95 89.42 68.72 51.46 65.34
MMBT [30] 82.48 81.27 82.15 85.82 64.78 88.66 65.36 52.12 69.34
Score Fusion 88.16 83.46 85.26 86.98 54.01 88.96 71.23 53.48 66.26
Feature Fusion 87.56 85.20 86.55 89.17 67.28 91.40 67.60 40.62 56.47
Attention Variant 1 (Ours) 89.29 85.68 87.04 88.41 64.60 90.71 71.51 55.41 69.71
Attention Variant 2 (Ours) 88.34 86.12 87.42 89.23 67.63 91.56 63.13 58.03 69.39
Attention Variant 3 (Ours) 88.20 86.22 87.47 87.18 64.67 90.24 68.99 57.42 69.16
SSE-Cross-BERT-DenseNet (Ours) 89.33 88.09 89.35 91.14 68.41 91.82 72.65 59.76 70.41
Table 2: Setting A: Informativeness Task, Humanitarian Categorization Task and Damage Severity Task Evaluations.

Setting B: We relax the assumption in Equation 14 and allow in training:


As the training set of this setting contains samples with inconsistent labels for image and text, multimodal fusion methods such as late feature fusion cannot deal with the training data. Our method, on the other hand, with the use of the proposed multimodal SSE, can transition the training instance with in consistent labels to a new training pair with consistent labels. We do this by manually setting for the training cases with inconsistent image-text labels (i.e. all the text samples are transitioned). Since unimodal models only receive one of the modalities, it is also possible to train them separately on images and texts and use an average of their prediction in the testing stage (also known as score level fusion).

However, we maintain the assumption of Eq. (14) for the test data. This helps to directly compare the two settings with the same test samples. In fact, in practice, the data is most valuable when the class labels match for both image and text. The rationale is that detecting an event is more valuable to crisis managers than the categorization of different parts of that event. Our dev and test sets for this setting are similar to the previous setting. However, the training set contains a larger number of samples where their image-text pairs are not necessarily labeled as the same class.

Setting C: This setting is closest to the real-world scenario where we analyze the new event of a crisis with a model trained on previous crisis events. First, we require the training and test sets to be from crisis events of a different nature (i.e., wildfire vs. flood). Second, we maintain the temporal component and only train on events that have happened before the tweets of the testing set. Since collecting annotated data on an urgent ongoing-event is not possible, and also because an event of crisis may do not have a similar annotated event in the past, these two restrictions often simulate a real-world scenario. For the experiments of this setting, there is no dev set. Instead, we use a random portion of the training data to tune the hyper-parameters.

We test on the tweets that are related to the California Wildfire (Oct 10 - 27, 2017), and train on the following three sets:

  1. Sri Lanka Floods tweets (May 31- Jul. 3, 2017)

  2. Sri Lanka Floods, and Hurricane Harvey and Hurricane Irma tweets (May 31- Sept. 21 , 2017)

  3. Sri Lanka Floods, Hurricanes Harvey and Irma and Mexico Earthquakes (May 31 - Oct. 5, 2017).

Similar to setting B, for the test set (i.e. California Wildfire) we only consider the samples with consistent labels for image and text, but for the training sets, we use all the available samples.

Informativeness Task Humanitarian Categorization Task
Model Accuracy Macro F1 Weighted F1 Accuracy Macro F1 Weighted F1
DenseNet [25] 83.36 80.95 82.95 82.89 66.68 83.13
BERT [14] 86.26 84.44 86.01 87.73 83.72 87.57
Score Fusion 87.03 85.19 86.90 91.41 83.26 91.36
SSE-Cross-BERT-DenseNet (Ours) 90.05 88.88 89.90 93.46 84.16 93.35
Best from Table 2 89.33 88.09 89.35 91.48 67.87 91.34
Table 3: Setting B: Informativeness Task and Humanitarian Categorization Task Evaluations
Sri Lanka Floods Sri Lanka Floods + Hurricanes Harvey & Irma Sri Lanka Floods + Hurricanes Harvey & Irma + Mexico earthquake
Model Accuracy Macro F1 Weighted F1 Accuracy Macro F1 Weighted F1 Accuracy Macro F1 Weighted F1
DenseNet [25] 55.71 35.77 56.85 70.32 52.23 68.55 70.32 44.80 68.79
BERT [14] 31.96 20.90 27.21 73.97 53.90 73.51 74.43 56.98 74.21
Score Fusion 56.62 36.77 57.96 81.74 56.54 81.03 81.28 55.90 80.54
SSE-Cross-BERT-DenseNet (Ours) 62.56 39.82 62.08 84.02 63.12 83.55 86.30 65.55 85.93
Table 4: Comparing our proposed method with baselines for Humanitarian Categorization Task in Setting 3. We fix the last occurred crisis namely ‘California wildfires’ as test data and vary the training data which is specified in the columns.
Test Set
Model Accuracy Macro F1 Weighted F1
SSE-Cross-BERT-DenseNet (Ours) 91.14 68.41 91.82
Self-Attention 89.23 56.50 87.70
Cross-Attention 88.48 56.38 87.10
Cross-Attention Co-Attention 88.41 64.60 90.71
Cross-Attention Self-Attention 86.30 58.33 85.27
Dropout 83.37 54.83 82.46
SSE 88.41 64.60 90.71
SSE Shuffling Within Class 88.68 62.91 88.33
SSE Mix-up [64] 89.16 54.63 87.37
Table 5: Ablation Study of our proposed method for Humanitarian Categorization Task in Setting A.

4.3 Baselines

We compare our method against several state-of-the-art methods for text and/or image classification. There are a number of categories of baseline methods we compare against. In the first category, we compare to DenseNet and BERT, which are of the most comonnly used unimodal classification networks for images and texts respectively. We use Wikipedia pre-trained BERT and pre-trained DenseNet on ImageNet 

[13], and fine-tune them on the training sets.

The second category of baseline methods include several recently proposed multimodal fusion methods for classification:

  • Compact Bilinear Pooling [18]: multimodal compact bilinear pooling is a fusion technique first used in visual question answering task but can be easily modified to perform standard classification task.

  • Compact Bilinear Gated Pooling [31]: this fusion method is an adaptation of the compact bilinear pooling method where an extra attention gate is added on top the compact bilinear pooling module.

  • MMBT [30]: recently proposed supervised multimodal bitransformers model for classifying images and text.

The third category is the score level Score Fusion and late feature fusion Feature Fusion of DenseNet and BERT networks. Score level fusion is one of the most common fusion techniques. It averages the predictions of separate networks trained on the different modalities. Feature Fusion is one of the most effective methods for integrating two modalities [47]. It concatenates deep layers from modality networks to predict a shared output. We also provide three variations of our attention modules and report their performance: The first variant is to replace cross-attention of Eq. (3.3) with co-attention of Eq. (3.3); the second variant is to remove self-attention; the third variant is to change the cross-attention with self-attention modules.

We compare our model, SSE-Cross-BERT-DenseNet, to the baseline models above.

4.4 Evaluation Metrics

We evaluate the models in this paper using classification accuracy,222In the settings that our experiments are defined classification accuracy is equivalent to Micro F1-score.

Macro F1-score and weighted F1-score. Note that while in the event of a crisis, the number of samples from different categories often significantly varies, it is important to detect all of them. F1-score and weighted F1-score take both false positives and false negatives into account, and therefore, along with accuracy as an intuitive measure, are proper evaluation metrics for our datasets.

4.5 Training Details

We use pre-trained DenseNet and BERT as our image and text backbone networks, and fine-tune them separately on text-only and image-only training samples. The details of their implementations can be found in [25] and [14], respectively. We do not freeze the pre-trained weights and train all the layers for both the backbone networks.

We use the standard SGD optimizer. We start with the base learning rate of with a reduction when the dev loss is saturated. We use a batch size of

. The models were implemented in Keras and Tensorflow-1.4 

[1]. In all the applicable experiments, we select hyper-parameters with cross-validation on the accuracy of dev set. For the experiments in Setting 3 that we do not have an evaluation set, we tune hyper-parameters on of the training samples. We select and respectively in the range of and .

We employ the following data augmentations on the images during the training stage. Images are resized such that the smallest side is 228 pixels, and then randomly cropped with a patch. In addition, we produce more images by randomly flipping the resulting image horizontally.

For tweet normalization, we remove double spaces and lower case all characters. In addition, we replace any hyperlink in the tweet with the sentinel word “link”.

5 Experimental Results

5.1 Setting A: Excluding The Training Pairs with Inconsistent Labels

As shown in Table 2, our proposed framework, SSE-Cross-BERT-DenseNet, easily outperforms the standalone DenseNet and BERT models. Compared with baseline methods Compact Bilinear Pooling [18], Compact Bilinear Gated Pooling [31], and MMBT [30], our proposed cross-attention fusion method does enjoy an edge over previous known fusion methods, including the standard score fusion and feature fusion. This edge holds true across Settings A, B and C. In section 5.4, we conduct an ablation study to investigate which components (SSE, cross-attention, and self-attention) have the most impact on model performance.

One important observation we find across the three tasks is that despite the fact that accuracy percentages are reasonably good for simple feature fusion method, the macro F1 scores improve much more once we add attention mechanisms.

5.2 Setting B: Including The Training Pairs with Inconsistent Labels

In this setting, we investigate whether our models can perform better if we can make use of more labelled data for un-matched images and texts. Note that this involves training on noisier data than the prior setting. In Table 3, our proposed framework SSE-Cross-BERT-DenseNet beats the best results from Setting A for both the Informativeness Task (89.90 to 89.35 Weighted F1) and the Humanitarian Categorization Task (93.35 to 91.34). The gap between our method versus standalone BERT and DenseNet also widens.

Note that the test sets are the same for setting A and setting B while only the training data differs.

5.3 Setting C: Temporal

This setting is designed to resemble a realistic scenario where the available data is (1) only from the past (i.e. the train / test sets are split in the order they occurred in the real world). (2) train and test sets are not from the same crisis. We find that our proposed model consistently performs better than standalone image and text models (see Table 4). Additionally, performance increases for all models, including ours, with the inclusion of more crisis data to train on. This emphasizes the importance of collecting and labelling more crisis data even if there is no guarantee that the crises we collected data from will be similar to a future one. In the experiments, training crises contain floods, hurricanes and earthquakes but the test crisis is fixed at wildfires.

5.4 Ablation Study

In our ablation study, we examine each component of the model in Figure 3: namely self-attention on concatenated embedding, cross-attention on fusing image feature map & sentence embedding, dropout and SSE regularization. All the experiments in this section are conducted in Setting A. First, we find self-attention plays an important role on the final performance, accuracy drops to 89.23 from 91.14 if self-attention is removed. Second, the choice of cross-attention over co-attention and self-attention is well justified: we see the accuracy performance drops to around 88 by replacing the cross-attention. Third, dropout regularization [53] plays an important role in regularizing the hidden units: if we remove dropout completely, performance suffers a large drop from 91.14 to 83.37. Fourthly, we justify the usage of SSE [61] over the choice of Mixup [64] or within-class shuffling data augmentation. SSE performs better than mixup in terms of accuracy 91.14% versus 89.16%, and even much better in terms of F1 scores, 68.41 versus 54.63 for macro F1 score and 91.82 versus 87.37 for weighted F1 score.

6 Conclusions and Future Work

In this paper, we presented a novel multimodal framework for fusing image and textual inputs. We introduced a new cross attention module that can filter not-informative or misleading information from modalities and only fuse the useful information. We also presented a multimodal version of Stochastic Shared Embeddings (SSE) to regularize the training process and deal with limited training data. We evaluate this approach on three crisis tasks involving social media posts with images and text captions. We show that our approach not only outperforms image-only and text-only approaches which have been the mainstay in the field, but also other multimodal combination approaches.

For future work we plan to test how our approach generalizes to other multimodal problems such as sarcasm detection in social media posts [11, 50], as well as experiment with different image and text feature extractors. Given that the CrisisMMD corpus is the only dataset available for this task and it is limited in size, we also aim to construct a larger set, which is a major effort.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    Tensorflow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI16), pp. 265–283. Cited by: §4.5.
  • [2] M. Abavisani, H. R. V. Joze, and V. M. Patel (2019) Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1165–1174. Cited by: §2.
  • [3] M. Abavisani and V. M. Patel (2018) Deep multimodal subspace clustering networks. IEEE Journal of Selected Topics in Signal Processing 12 (6), pp. 1601–1614. Cited by: §2.
  • [4] R. Abebe, S. Hill, J. W. Vaughan, P. M. Small, and H. A. Schwartz (2019) Using search queries to understand health information needs in africa. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 13, pp. 3–14. Cited by: §1, §2.
  • [5] K. Ahmad, M. Riegler, K. Pogorelov, N. Conci, P. Halvorsen, and F. De Natale (2017) Jord: a system for collecting information and monitoring natural disasters by linking social media with satellite imagery. In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, pp. 12. Cited by: §2.
  • [6] F. Alam, F. Ofli, and M. Imran (2018) Crisismmd: multimodal twitter datasets from natural disasters. In Twelfth International AAAI Conference on Web and Social Media, Cited by: §4.1.
  • [7] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §2.
  • [8] R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. Ikizler-Cinbis, F. Keller, A. Muscat, and B. Plank (2016) Automatic description generation from images: a survey of models, datasets, and evaluation measures. Journal of Artificial Intelligence Research 55, pp. 409–442. Cited by: §2.
  • [9] T. Blevins, R. Kwiatkowski, J. MacBeth, K. McKeown, D. Patton, and O. Rambow (2016-12) Automatically processing tweets from gang-involved youth: towards detecting loss and aggression. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 2196–2206. External Links: Link Cited by: §2.
  • [10] S. Buechel, A. Buffone, B. Slaff, L. Ungar, and J. Sedoc (2018) Modeling empathy and distress in reaction to news stories. arXiv preprint arXiv:1808.10399. Cited by: §2.
  • [11] S. Castro, D. Hazarika, V. Pérez-Rosas, R. Zimmermann, R. Mihalcea, and S. Poria (2019-07) Towards multimodal sarcasm detection (an _Obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4619–4629. External Links: Link, Document Cited by: §6.
  • [12] S. Chen and Q. Jin (2015)

    Multi-modal dimensional emotion recognition using recurrent neural networks

    In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp. 49–56. Cited by: §2.
  • [13] J. Deng, R. Socher, L. Fei-Fei, W. Dong, K. Li, and L. Li (2009-06) ImageNet: a large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Vol. 00, pp. 248–255. External Links: Document, Link Cited by: §4.3.
  • [14] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.2, §3.2, §3, §4.5, Table 2, Table 3, Table 4, Table 6.
  • [15] J. C. Eichstaedt, R. J. Smith, R. M. Merchant, L. H. Ungar, P. Crutchley, D. Preoţiuc-Pietro, D. A. Asch, and H. A. Schwartz (2018) Facebook language predicts depression in medical records. Proceedings of the National Academy of Sciences 115 (44), pp. 11203–11208. Cited by: §2.
  • [16] J. C. Eichstaedt, H. A. Schwartz, S. Giorgi, M. L. Kern, G. Park, M. Sap, D. R. Labarthe, E. E. Larson, M. Seligman, L. H. Ungar, et al. (2018) More evidence that twitter language predicts heart disease: a response and replication. Cited by: §2.
  • [17] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In Advances in neural information processing systems, pp. 2121–2129. Cited by: §2.
  • [18] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847. Cited by: §2, §2, 1st item, Table 2, §5.1, Appendix: Setting D Multi-Label Multi-class Categorization.
  • [19] T. Gebru, J. Krause, Y. Wang, D. Chen, J. Deng, E. L. Aiden, and L. Fei-Fei (2017) Using deep learning and google street view to estimate the demographic makeup of neighborhoods across the united states. Proceedings of the National Academy of Sciences 114 (50), pp. 13108–13113. Cited by: §1, §2.
  • [20] S. C. Guntuku, D. Preotiuc-Pietro, J. C. Eichstaedt, and L. H. Ungar (2019) What twitter profile and posted images reveal about depression and anxiety. arXiv preprint arXiv:1904.02670. Cited by: §2.
  • [21] C. Harding, F. Pompei, D. Burmistrov, H. G. Welch, R. Abebe, and R. Wilson (2015) Breast cancer screening, incidence, and mortality across us counties. JAMA internal medicine 175 (9), pp. 1483–1489. Cited by: §1, §2.
  • [22] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [23] J. Hessel, B. Pang, Z. Zhu, and R. Soricut (2019) A case study on combining asr and visual features for generating instructional video captions. arXiv preprint arXiv:1910.02930. Cited by: §2.
  • [24] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
  • [25] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §3.1, §3, §4.5, Table 2, Table 3, Table 4, Table 6.
  • [26] I. Ilievski and J. Feng (2017) Multimodal learning and reasoning for visual question answering. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 551–562. External Links: Link Cited by: §2, Appendix: Setting D Multi-Label Multi-class Categorization.
  • [27] A. Ilyas (2014) MicroFilters: harnessing twitter for disaster management. In IEEE Global Humanitarian Technology Conference (GHTC 2014), pp. 417–424. Cited by: §2.
  • [28] W. Inc.(Website) Note: -Wikipedia-DataAccessed: 2020-03-30 Cited by: §3.2.
  • [29] S. Kelly, X. Zhang, and K. Ahmad (2017) Mining multimodal information on social media for increased situational awareness. Cited by: §2.
  • [30] D. Kiela, S. Bhooshan, H. Firooz, and D. Testuggine (2019) Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950. Cited by: 3rd item, Table 2, §5.1.
  • [31] D. Kiela, E. Grave, A. Joulin, and T. Mikolov (2018) Efficient large-scale multi-modal classification. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: 2nd item, Table 2, §5.1.
  • [32] S. Kumar, G. Barbier, M. A. Abbasi, and H. Liu (2011) Tweettracker: an analysis tool for humanitarian and disaster relief. In Fifth international AAAI conference on weblogs and social media, Cited by: §2.
  • [33] Z. Lan, L. Bao, S. Yu, W. Liu, and A. G. Hauptmann (2014) Multimedia classification and event detection using double fusion. Multimedia tools and applications 71 (1), pp. 333–347. Cited by: §2.
  • [34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    arXiv preprint arXiv:1909.11942. Cited by: §3.2, §3.2.
  • [35] K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. Cited by: §2.
  • [36] L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §2.
  • [37] X. Li, D. Caragea, H. Zhang, and M. Imran (2018) Localizing and quantifying damage in social media images. In 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pp. 194–201. Cited by: §2.
  • [38] K. Liu, Y. Li, N. Xu, and P. Natarajan (2018) Learn to combine modalities in multimodal deep learning. arXiv preprint arXiv:1805.11730. Cited by: §2.
  • [39] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §2, §3.3.
  • [40] S. Madichetty and M. Sridevi (2019) Detecting informative tweets during disaster using deep neural networks. In 2019 11th International Conference on Communication Systems & Networks (COMSNETS), pp. 709–713. Cited by: §2.
  • [41] H. Mouzannar, Y. Rizk, and M. Awad (2018) Damage identification in social media posts using multimodal deep learning.. In ISCRAM, Cited by: §2.
  • [42] G. Nalluru, R. Pandey, and H. Purohit (2019) Relevancy classification of multimodal social media streams for emergency services. In 2019 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 121–125. Cited by: §2.
  • [43] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. Cited by: §2.
  • [44] P. Perera, M. Abavisani, and V. M. Patel (2018)

    In2i: unsupervised multi-image-to-image translation using generative adversarial networks

    In 2018 24th International Conference on Pattern Recognition (ICPR), pp. 140–146. Cited by: §2.
  • [45] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf. Cited by: §3.2.
  • [46] T. Rahman, B. Xu, and L. Sigal (2019) Watch, listen and tell: multi-modal weakly supervised dense event captioning. In CVPR, pp. 8908–8917. Cited by: §2, §2.
  • [47] D. Ramachandram and G. W. Taylor (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Processing Magazine 34 (6), pp. 96–108. Cited by: §4.3.
  • [48] T. G. Rudner, M. Rußwurm, J. Fil, R. Pelich, B. Bischke, and V. Kopacková Rapid computer vision-aided disaster response via fusion of multiresolution, multisensor, and multitemporal satellite imagery. Cited by: §2.
  • [49] N. Said, K. Ahmad, M. Regular, K. Pogorelov, L. Hassan, N. Ahmad, and N. Conci (2019) Natural disasters detection in social media and satellite imagery: a survey. arXiv preprint arXiv:1901.04277. Cited by: §2, §2.
  • [50] R. Schifanella, P. de Juan, J. Tetreault, and L. Cao (2016) Detecting sarcasm in multimodal social platforms. Proceedings of the 2016 ACM on Multimedia Conference - MM ’16. External Links: ISBN 9781450336031, Link, Document Cited by: §6.
  • [51] H. Shekhar and S. Setty (2015) Disaster analysis through tweets. In 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1719–1723. Cited by: §2.
  • [52] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943. Cited by: §2.
  • [53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §5.4.
  • [54] K. Stowe, M. J. Paul, M. Palmer, L. Palen, and K. Anderson (2016) Identifying and categorizing disaster-related tweets. In Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pp. 1–6. Cited by: §2.
  • [55] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.
  • [56] H. Tan and M. Bansal (2019) Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §2.
  • [57] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: §3.1.
  • [58] H. To, S. Agrawal, S. H. Kim, and C. Shahabi (2017) On identifying disaster-related tweets: matching-based or learning-based?. In 2017 IEEE Third International Conference on Multimedia Big Data (BigMM), pp. 330–337. Cited by: §2.
  • [59] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.
  • [60] M. Wöllmer, A. Metallinou, F. Eyben, B. Schuller, and S. Narayanan (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling. In Proc. INTERSPEECH 2010, Makuhari, Japan, pp. 2362–2365. Cited by: §2.
  • [61] L. Wu, S. Li, C. Hsieh, and J. L. Sharpnack (2019) Stochastic shared embeddings: data-driven regularization of embedding layers. In Advances in Neural Information Processing Systems, pp. 24–34. Cited by: §3.4, §3.4, §3, §5.4.
  • [62] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §3.2, §3.2.
  • [63] J. Yin, S. Karimi, A. Lampert, M. Cameron, B. Robinson, and R. Power (2015) Using social media to enhance emergency situation awareness. In Twenty-fourth international joint conference on artificial intelligence, Cited by: §2.
  • [64] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, External Links: Link Cited by: Table 5, §5.4.

Appendix: Setting D Multi-Label Multi-class Categorization

In previous experiments of this paper, we followed prior research in crisis event categorization and viewed the task as a multi-class single-label task. In this section, we provide three simple modifications to our model for extending it to a multi-label multi-class classifier.

In a multimodal single-label classification system, representations of different modalities are often fused to construct a joint representation from which a common label is reasoned for the multimodal-pair. Our classifiers in settings A, B, and C are multimodal multi-class single-label models. However, in setting D, we are interested in using both image and text information to predict separate labels for them. Figure 4 (a) and (b) show examples of these settings.

In Figure 4 (a), the multimodal pair, including image and text are both labeled as Vehicle Damage. On the contrary, in Figure 4 (b), while the image shows damaged vehicles, the text-only contains information about the location of the event and therefore does not fall in the Vehicle Damage category. In setting D, we want to use the information in both image and text to classify the image of this example into the Vehicle Damage class and the text into the Other Relevant Information class.

Cross-Attention: A straightforward way to capture these properties is by attaching two classifier heads to the output of the cross-attention module in our proposed model. We refer to this version as Cross-Attention classifier.

Self-Attention: The cross-attention mechanism in Eq. (4) uses text embeddings (image feature maps) to block misleading information from image feature maps (text embeddings). However, in setting D, since image and text may have different labels, they both can be informative but contain different information. Thus, we replace this module by separate self-attention blocks [18, 26] in each modality. That is, we still filter the uninformative features, but we do that based on the information in the modality itself.

Self-Cross-Attention: In the Self-Attention

extension, the features of different modalities do not interact directly with each other. With a few modifications to the self-attention extension and combining it with our cross-attention model, one can develop a version of our method that is specifically designed for multi-label multi-class classification tasks. We use a self-attention block to learn a mask that filters the uninformative features from the modalities. In the meantime, we invert this mask and use the invert mask to attend to the other modality for selecting useful features. This way, not only do we develop modality-specific features, but we do so by exploiting useful information from both modalities. Let

and be the self-attention masks that are calculated as:


From equation (Appendix: Setting D Multi-Label Multi-class Categorization), we can calculate the inverse-masks by


After we have the attention masks and the inverse of them, we can calculate the augmented image features and augmented text feature as


where and are same as in Eq. (3) in the paper. We feed and to classifier heads of images and texts, respectively.

Figure 4: The behavior of our classifiers in different settings. (a) Our classifiers in settings A, B, and C view the task as a multi-class single-label task. (b) Our classifiers in setting D view the task as a multi-class multi-label task.

6.1 Experiments:

We evaluate the multi-label extensions in Task 1. In this experiment, both training and test sets contain inconsistent labels. That is in both training and testing we may have:

Model Acc Macro F1 Weighted F1
DenseNet [25] Images : 78.30 78.30 78.31
BERT [14] Text : 82.63 74.93 80.87
Feature Fusion Images : 78.37 78.15 78.21
Texts: 83.63 79.01 83.22
Cross-Attention Images : 77.17 77.51 77.51
Texts: 83.35 79.60 83.41
Self-Attention Images : 82.56 82.54 82.56
Texts: 83.63 76.79 82.17
Self-Cross-Attention Images : 81.64 81.51 81.55
Texts: 83.45 78.22 82.78
Table 6: Setting D: Informativeness Evaluation

As the test set of this setting contains samples with inconsistent labels for image and text, we set for the training cases so that we include inconsistent image-text labels in training as well. In particular, we use and . Benchmarks for this setting include unimodal models as well as a version of the feature fusion model with two classification heads.

We evaluate our method on Task 1. We keep the ratio between the number of samples in train and test sets similar to setting B in Table 2. However, we randomly sample with relaxing the Eq. (9) assumption of the paper for both the train and test sets.

In Table 6, the result of different methods are compared in terms of Accuracy, Macro-F1, and Weighted F1. By comparing unimodal DenseNet and BERT results with Table 4, we observe that the test set in setting D, with inconsistent labels for images and texts, is more challenging than the test set in previous settings. As can be seen, most methods have an advantage over unimodal DenseNet and BERT. The Cross-Attention method provides better results for text, and Self-Attention method provides better results for images. The Self-Cross-Attention, on average, provides comparable results to the Self-Attention and Cross-Attention methods for both the modalities. Note that in all three attention methods, the multimodal-SSE technique has been used, which provides additional training data (with both consistent and inconsistent labels).