Information from different modalities often brings complementary signals about a concept, an object, an event, and the like. Learning from these different modalities leads to more robust inference compared to learning from a single modality. Multimodal learning is a well-researched area and has been applied in many fields including audio-visual analysis (e.g., videos) Poria2016,Pereira2016, cross-modal study Nagrani2018, and speech processing (e.g., audio and transcriptions) Chowdhury2019. Despite the successes of multimodal learning in other areas, limited focus has been given to multimodal social media data analysis until recently gautam2019multimodal. In particular, using social media data for social good requires time-critical analysis of the multimedia content (e.g., textual messages, images, videos) posted during a disaster situation to help humanitarian organizations in preparedness, mitigation, response, and recovery efforts.
Figure 1 shows a few tweets with associated images collected from three recent major disasters. We observe that relying on a single modality may often miss important insights. For instance, although the tweet text in Figure 1(f) reports about a 6.1 magnitude earthquake in Southern Mexico, the scale of the damage incurred by this earthquake cannot be inferred from the text. However, if we analyze the image attached to this tweet, we can easily understand the immense destruction caused by the earthquake.
|Hurricane Maria||California Wildfires||Mexico Earthquake|
|(a) Hurricane Maria turns Dominica into ‘giant debris field’ https://t.co/rAISiAhMUy by #AJEnglish via ¡USER¿||(b) A friend’s text message saved Sarasota man from deadly California wildfire https://t.co/0TNMFgL885||(c) Earthquake leaves hundreds dead, crews combing through rubble in #Mexico https://t.co/XPbAEIBcKw|
|(d) Corporate donations for Hurricane Maria relief top $24 million https://t.co/w34ZZziu88||(e) California Wildfires Threaten Significant Losses for P/C Insurers, Moodya Says https://t.co/ELUaTkYbzZ||(f) Southern Mexico rocked by 6.1-magnitude earthquake CLICK BELOW FOR FULL STORY… https://t.co/Vkz6fNVe5s…|
Most of the previous studies that rely on social media for disaster response have mainly focused on textual content analysis, and little focus has been given to images shared on social media imran2015processing,CarlosCastillo2016. Many past research works have demonstrated that images shared on social media during a disaster event can help humanitarian organizations in a number of ways. For example, [nguyen17damage] use images shared on Twitter to assess the severity of infrastructure damage. [Mouzannar2018] also focus on identifying damages in infrastructure and environmental elements. Taking a step further, [gautam2019multimodal] have recently presented a work on multimodal analysis of crisis-related social media data for identifying informative tweet text and image pairs.
In this study, we also aim to use both text and image modalities of Twitter data to learn (i) whether a tweet is informative for humanitarian aid or not, and (ii) whether it contains some useful information such as a report of injured or deceased people, infrastructure damage, etc. We tackle this problem in two separate classification tasks and solve them using multimodal deep learning techniques.
The typical approach to deal with multimodality includes feature- or decision-level fusion, which is also termed as early and late fusion kuncheva2004combining,alam2014. In deep learning architectures, multimodality is combined at the hidden layers with a different variant of network architecture such as static, dynamic, and N-way classification as can be seen in ngiam2011multimodal,Nagrani2018,Chowdhury2019. Specifically, [ngiam2011multimodal] explore different architectures for audio-visual data. Their study includes unimodal as well as cross-modal learning (i.e., learning one modality while giving multiple modalities during feature learning), multimodal fusion, and shared representation learning. [Nagrani2018] also study audio-visual data for a biometric matching task while they investigate different deep neural network architectures for developing a multimodal network whereas [Chowdhury2019] analyze audio and transcriptions while concatenating both modalities in a hidden layer.
In this work, we propose to learn a joint representation from two parallel deep learning architectures where one architecture represents the text modality and the other architecture represents the image modality. For the image modality, we use the well-known VGG16 network architecture and extract high-level features of an input image using the penultimate fully-connected (i.e., fc2) layer of the network. For the text modality, we define a Convolutional Neural Network (CNN) with five hidden layers and different filters. Two feature vectors obtained from both modalities are then fed into a shared representation followed by a dense layer before performing a prediction using softmax. In the literature, this type of joint representation is also known as early fusion. The proposed multimodal architecture is trained using three different settings as follows: (i) train a network using input data from both modalities, (ii) train a network using only the text modality, and (iii) train a network using only the image modality.
We perform extensive experiments on a real-world disaster-related dataset collected from Twitter, i.e,. CrisisMMD alam2018crisismmd, using the aforementioned three training settings for two different classification tasks: informativeness and humanitarian classification. The test data for the evaluation of all three settings are fixed. The experimental results show that the proposed approach (i.e., multimodal learning) outperforms our baseline models trained on a single modality (i.e., either text or image). For the informativeness classification task, our best model obtained an F1-score=84.2 and for the humanitarian classification task, our best model achieved an F1-score=78.3. Despite the fact that this model outperforms its counter-part unimodal baseline models (i.e., trained on a single modality), we remark that there is a big room for improvement, which we leave as future work. To the best of our knowledge, this is the first study that presents baseline results on CrisisMMD using state-of-the-art deep learning-based unimodal and multimodal approaches for both informativeness and humanitarian tasks, all in one place. We hope that experimental analyses presented in this study will provide guidance for future research using the CrisisMMD dataset.
The rest of the paper is organized as follows. In the Related Work section, we provide a review of the literature. Then, in the Dataset section, we present details of the dataset used in this study. Next, in the Experiments section, we describe the methodology and discuss experimental results. We then present possible applications and future directions in the Discussion section. Finally, we conclude the paper with the Conclusions section.
2 Related Work
Many past studies have analyzed social media data, especially textual content, and demonstrated its useful for humanitarian aid purposes imran2015processing, CarlosCastillo2016. With recent successes of deep learning, research works have started to use social media images for humanitarian aid. For instance, the importance of imagery content on social media has been reported in many studies petersinvestigating,daly2016mining,chen2013understanding,nguyen2017automatic,nguyen17damage,alam17demo,alam2019SocialMedia. [petersinvestigating] analyzed the data collected from Flickr and Instagram for the flood event in Saxony, 2013. Their findings suggested that the existence of images within on-topic textual content were more relevant to the disaster event, and the imagery content also provided important information, which was related to the event. Similarly, [daly2016mining] analyzed images extracted from social media data, which is focused on a fire event. They analyzed spatio-temporal meta-data associated with the images and suggested that geotagged information are useful to locate the fire affected areas. The analysis of imagery content shared on social media has been explored using deep learning techniques in several studies nguyen2017automatic,nguyen17damage,alam17demo. Furthermore, [alam2019SocialMedia]
presented an image processing pipeline to extract meaningful information from social media images during a crisis situation, which has been developed using deep learning-based techniques. Their image processing pipeline includes collecting images, removing duplicates, filtering irrelevant images, and finally classifying them with damage severity.
Combining textual and visual content can provide highly relevant information as discussed by [bica2017visual] where they explored social media images posted during two major earthquakes in Nepal during April-May 2015. Their study focused on identifying geo-tagged images and their associated damage. [chen2013understanding] studied the association between tweets and images, and their use in classifying visually relevant and irrelevant tweets. They designed classifiers by combining features from the text, images and socially relevant contextual features (e.g., posting time, follower ratio, the number of comments, re-tweets), and reported an F1-score of in a binary classification task, which is higher than the text-only classification. Recently, [Mouzannar2018] explored damage detection by focusing on human and environmental damages. Their study explores unimodal as well as different multimodal modeling setups based on a collection of multimodal social media posts labeled with six categories such as infrastructural damage (e.g., damaged buildings, wrecked cars, and destroyed bridges), damage to natural landscape (e.g., landslides, avalanches, and falling trees), fires (e.g., wildfires and building fires), floods (e.g., city, urban and rural), human injuries and deaths, and no damage. Similarly, [gautam2019multimodal] presented a comparison of unimodal and multimodal methods on crisis-related social media data using an approach based on decision fusion for classifying tweet text and image pairs into informative and non-informative categories.
For the tweet classification task, deep learning-based techniques such as Convolutional Neural Networks (CNN) nguyen2017robust, and Long-Short-Term-Memory Networks (LSTM) rosenthal2017semeval have been widely used. For the image classification task, state-of-the-art works also utilize different techniques of deep neural networks such as Convolutional Neural Networks (CNN) with deep architectures. Among different CNN architectures, the most popular are VGG simonyan2014very, AlexNet krizhevsky2012imagenet, and GoogLeNet szegedy2015going. The VGG is designed using an architecture with very small (33) convolution filters and with a depth of 16 and 19 layers. The 16-layer network is referred to as VGG16 network, which we used in this study.
For combining multiple modalities, early and late fusion have been the traditional approaches kuncheva2004combining. The early-fusion approaches combine features from different modalities while the late-fusion approaches make classification decisions either by majority voting or stacking methods (i.e., combining classifiers’ output and making a decision by training another classifier). In deep learning paradigm, typically, hidden layers are combined using approaches such as static or dynamic concatenation as discussed in Nagrani2018,Chowdhury2019. In this study, we follow CNN-based deep learning architectures for both unimodal and multimodal experiments. We extract high-level features from two independent modality-specific networks (i.e., text and image) and concatenate them to form a shared representation for our classification tasks.
Our study differs from previous studies in a number of ways Mouzannar2018,gautam2019multimodal. As such, we experiment with one of the largest, publicly available datasets (i.e., CrisisMMD) to provide baseline results for two popular crisis response tasks, i.e., informativeness and humanitarian categorization, using multimodal deep learning with a feature-fusion approach. In contrast, [Mouzannar2018] focus only on human and environmental damages using a home-grown dataset, which limits generalization of their findings. As for [gautam2019multimodal], although they also use a subset of the CrisisMMD dataset, they focus only on the informativeness classification task and employ a decision-fusion approach in their experiments. Unfortunately, due to potential differences in data organization (i.e., training, validation, and test splits), our experimental results are not exactly comparable with theirs.
We use CrisisMMD111http://crisisnlp.qcri.org/ dataset, which is a multimodal dataset consisting of tweets and associated images collected during seven different natural disasters that took place in 2017 alam2018crisismmd. The dataset is annotated for three tasks: (i) informative vs. not-informative, (ii) humanitarian categories (eight classes), and (iii) damage severity (three classes). Because the third task, i.e., damage severity, was applied only on images, we do not consider this task in the current study and focus only on the first two tasks as follows.
Informative vs. Not-informative. The purpose of this task is to determine whether a given tweet text or image, collected during a disaster event, is useful for humanitarian aid purposes222A detailed definition of humanitarian aid is provided in alam2018crisismmd.. If the given text (image) is useful for humanitarian aid, it is considered as an “informative” tweet (image), otherwise as a “not-informative” tweet (image).
Humanitarian Categories. The purpose of this task is to understand the type of information shared in a tweet text/image, which was collected during a disaster event. Given a tweet text/image, the task is to categorize it into one of the categories listed in Table 1.
An important property of CrisisMMD is that some of the co-occurring tweet text and image pairs have different labels for the same task because text and image modalities were annotated separately and independently. Therefore, in this study, we consider only a subset of the original dataset where text and image pairs have the same label for a given task. As a result of this filtering, some of the categories in the humanitarian task are left with only a few pairs of tweet text and image. This situation skews the overall label distribution and creates a challenging setup for model training. To deal with this issue, we combine those minority categories that are semantically similar or relevant. Specifically, we merge the “injured or dead people” and “missing or found people” categories into the “affected individuals” category. Similarly, we merge “vehicle damage” category into the “infrastructure and utility damage” category.As a result, we are left with five categories for the humanitarian task as shown in Table 1.
Twitter allows attaching up to four images to a tweet, and hence, CrisisMMD contains some tweets that have more than one image, i.e., the same tweet text is associated with multiple images. While splitting data into training, development, and test sets, we need to make sure that no duplicate tweet text exists across these splits. To achieve this, we simply assign all tweets with multiple images only to the training set. This also ensures that there are no repeated data points (i.e., tweet text) in the development and test sets for the unimodal experiments on text modality. While doing so, we maintain a 70%:15%:15% data split ratio for training, development, and test sets, respectively. Table 1 provides the final set of categories, total number of tweet text and images in each category as well as their split into training, development, and test sets.333The data split used in the experiments can be found online at http://crisisnlp.qcri.org/. Note that the total number of tweet text and images in the table differ only for the training sets as per the strategy explained above.
|Train (70%)||Dev (15%)||Test (15%)||Total|
|Rescue volunteering or donation effort||762||912||149||149||126||126||1,037||1,187|
|Infrastructure and utility damage||496||612||80||80||81||81||657||773|
|Other relevant information||1,192||1,279||239||239||235||235||1,666||1,753|
As explained in the previous section, we have two sets of annotations for two separate classification tasks, i.e., informativeness and humanitarian. For each one of these tasks, we perform three classification experiments where we train models using (i) only tweet text, (ii) only tweet image, and (iii) tweet text and image together.
In the following subsections, we first describe the data preprocessing steps and then describe the deep learning architecture used for each modality as well as their training details. To measure the performance of the trained models, we use several well-known metrics such as accuracy, precision, recall, and F1-score.
4.1 Data Preprocessing
The textual content of tweets is often noisy, usually consisting of many symbols, emoticons, and invisible characters. Therefore, we preprocess them by removing stop words, non-ASCII characters, numbers, URLs, and hashtag signs. We also replace all punctuation marks with white-spaces.
On the image side, we follow the typical preprocessing steps of scaling the pixels of an image between 0 and 1 and then normalizing each channel with respect to the ImageNet dataset deng2009imagenet.
4.2 CNN: Text Modality
For the text modality, we use Convolutional Neural Network (CNN) due to its better performance in crisis-related tweet classification tasks reported in nguyen2017robust. Specifically, we create a CNN consisting of 5 hidden layers. To input the network, we zero-padded the tweets for an equal length and then converted them into a word-level matrix where each row represents a word in the tweet extracted using a pre-trained word2vec model discussed in Alam2018GraphBS. This word2vec model is trained using the Continuous Bag-of-Words (CBOW) approach of[mikolov2013efficient] on a large disaster-related dataset of size million tweets with vector dimensions of , a context window size of 5 and negative samples.
The input then goes through a series of sequential layers including the convolutional layer, followed by the max-pooling layer, to obtain a higher-level fixed-size feature representation for each tweet. These fixed-size feature vectors are then passed through one or more fully connected hidden layers, followed by an output layer. In the convolutional and fully-connected layers, we use rectified linear units (ReLU) krizhevsky2012imagenet as the activation function, and in the output layer, we use the softmax activation function.
We train the CNN models using the Adam optimizer zeiler2012adadelta. The learning rate is set to
when optimizing for the classification loss on the development set. The maximum number of epochs is set to 50, and dropout srivastava2014dropout rate ofis used to avoid overfitting. We set early-stopping
criterion based on the accuracy on the development set with the patience of 10. We use 100, 150, and 200 filters with the corresponding window size of 2, 3, and 4, respectively. We use the same pooling length as the filter window size. We also apply batch normalization due to its success reported in the literature ioffe2015batch.
4.3 VGG16: Image Modality
For the image modality, we employ a transfer learning approach, which is an effective approach for visual recognition tasks yosinski2014transferable,ozbulak2016transferable. The idea of the transfer learning approach is to use existing weights of a pre-trained model. We use the weights of a VGG16 model pre-trained on ImageNet to initialize our model. We adapt the last layer (i.e., softmax layer) of the network according to the particular classification task at hand instead of the original 1,000-way classification. The transfer learning approach allows us to transfer the features and the parameters of the network from the broad domain (i.e., large-scale image classification) to the specific one, in our case informativeness and humanitarian classification tasks.
We train the image models using the Adam optimizer zeiler2012adadelta with an initial learning rate of , which is reduced by a factor of when accuracy on the development set stops improving for 100 epochs. We set the maximum number of epochs to 1,000 with an early-stopping criterion.
4.4 Multimodal: Text and Image
In Figure 2, we present the architecture of the multimodal deep neural network that we use for the experiment. As can be seen in the figure, for the image modality we use the VGG16 network. For the text modality, we use a CNN based architecture. Before forming the shared representations from both modalities we have another hidden layer of size 1,000 from each side. The reason to choose the same size is to have an equal contribution from both modalities. In the current experimental setting, there is no specific reason for choosing the size of 1,000, which can be optimized empirically. After the concatenation of both modalities, we have one hidden layer before the softmax layer.
We use the Adam optimizer with a minibatch size of for training the model. In order to avoid overfitting, we use early-stopping condition, and as an activation function, we choose ReLU. For this experiment, we do not tune any hyper-parameter (e.g., the size of hidden layers, filter size, dropout rate, etc.). Hence, there is room for further improvement in future studies.
In Tables 2 and 3, we present the performance results achieved for different tasks and modalities. In the unimodal experiments, the image-only models perform better than the text-only models in both informativeness and humanitarian tasks. Specifically, the improvement is more than 2% on average in the informativeness task whereas it is more than 6% on average in the humanitarian task. In the multimodal experiments, we observe additional improvements in performance in both tasks. Specifically, multimodal model performs about 1% better than the image-only model in all measures for the informativeness task and about 2% better than the image-only model in all measures for the humanitarian task. These results confirm that multimodal learning approach provides further performance improvement over the unimodal learning approach.
Overall performance of the humanitarian classification models is lower than the informativeness classification models due to the relatively more complex nature of the former task. However, it is important to note that the results presented in this study are obtained using basic network architectures and should be considered as a baseline study.
|Multimodal||Text + Image||84.4||84.1||84.0||84.2|
|Multimodal||Text + Image||78.4||78.5||78.0||78.3|
Deep learning models are data hungry. Given our initial condition to consider only the tweet text and image pairs that have the same labels for a given task, we are left with a limited subset of the original CrisisMMD dataset. The proposed multimodal joint learning showed comparable performance over unimodal models for both text and image modalities. Furthermore, while designing a model for multiple modalities an important challenge is to find concatenation strategies that better capture important information from both modalities. Towards this direction, we design the model by concatenating hidden layers into another layer to form a joint shared representation.
We further analyzed the performance of the three models (i.e., text-only, image-only, and text + image) by examining their confusion matrices. Table 4
shows three confusion matrix for the three models for the first task (i.e., informative vs. not-informative). From the emergency managers’ point of view, it is important that the machine does not miss any useful and relevant message/tweet. The three confusion matrices (a, b, & c) reveal that our text-only and image-only models missed 155 and 114 instances, respectively, whereas multimodal model missed only 101 instances. These are the instances where machine says “not informative”, but the ground-truth labels (i.e., human annotators) say “informative” (a.k.a. false negatives). The image-only model made significant improvements over the text-only model, however, when text and image modalities are combined in the multimodal case, the error rate dropped significantly (i.e., from 155 to 101).
Another important aspect is related to information overload on emergency managers during a disaster situation. Specifically, it happens when the machine says a message is informative, but according to ground-truth labels it is not (a.k.a. false positives). The confusion matrices in Table 4 show these mistakes made by the three models as 139 by the text-only, 145 by image-only, and 139 in the multimodal case. We do not observe any improvements from the multimodal approach as observed in the false negative case.
Table 5 shows confusion matrices from the three models for the humanitarian categorization tasks. One prominent and important column to observe is “N”, which corresponds to the “not-humanitarian” category, and shows all instances where the model prediction is “not-humanitarian”. In particular, if we look at the number of instances where actual label is “infrastructure and utility damage” (denoted as “I”) but the model prediction is “not-humanitarian” (i.e., the value of the cell at the intersection of row “I” and column “N”), we see that the text-only model has 41 false negative instances in Table 5(a) whereas the image-only and multimodal models have 13 and 10 instances in Tables 5(b) and 5(c), respectively. This indicates that the image modality helps models better understand the “infrastructure and utility damage” category, and hence, significantly reduce the errors in the predictions. A similar phenomenon can be observed in favor of the text modality for the case where the actual label is “rescue, volunteering or donation effort” (denoted as “R”) whereas the predicted label is “not-humanitarian” (i.e., the value of the cell at the intersection of row “R” and column “N”). Specifically, the image-only model has 43 false negative instances in Table 5(b) while the text-only and multimodal models have 37 and 33 instances in Tables 5(a) and 5(c), respectively. In general, we see that the number of such errors are minimized by the third model which uses both text and image modalities together. However, there are still some cases where significant improvements can be achieved. For instance, the “other relevant information” category (denoted as “O”) seems to create confusion for all the models, which needs to be investigated in a more detailed study.
|(a) ¡USER¿ Hurricane Lady #Maria It’ll rain burning blood. I hope Puerto Rico knows how to do Visceral Attacks.||(b) Hurricane Irma: Rapid response team ‘rescues’ fine wines - https://t.co/pUEeOixSdc #finewine #HurricaneIrma||(c) RT ¡USER¿: Hurricane Irma nearly ruins a wedding day here in northeast Ohio! Social meeting comes to the rescue|
|Unimodal: informative (✗)||Unimodal: not-informative (✗)||Unimodal: informative (✗)|
|Multimodal: not-informative (✓)||Multimodal: informative (✓)||Multimodal: not-informative (✓)|
|(d) 6th grade Maryland student collects 3,000 cases of drinking water for Puerto Rico https://t.co/x57AeLHeaC||(e) Hurricane Harvey’s impact on the US oil industry https://t.co/zxVWR3u0fU||(f) #Breaking Tornado warning for Lantana Rd south to Boca Raton. #BeSafe ¡USER¿|
|Text-only: not-humanitarian (✗)||Text-only: other relevant information (✗)||Text-only: not-humanitarian (✗)|
|Image-only: not-humanitarian (✗)||Image-only: not-humanitarian (✗)||Image-only: not-humanitarian (✗)|
|Multimodal: rescue, volunteering or donation effort (✓)||Multimodal: infrastructure and utility damage (✓)||Multimodal: other relevant information (✓)|
Figure 3 shows example image and text pairs that illustrate how joint modeling of image and text modalities can yield better predictions, and hence, lead to performance improvements over unimodal models. For instance in (a), we reckon that the image-only model thinks the image is informative because it shows fire-like patterns whereas the text-only model thinks the text is informative because it mentions rain burning blood. However, when these two modalities are evaluated together, they do not really provide any consistent evidence for the model to label this image-text pair as informative any more. Similarly, in (d), evaluating image alone or text alone does not result in correct predictions whereas joint evaluation of image and text yields the correct label, i.e., “rescue, volunteering or donation effort”. Furthermore, we observe another interesting case in (e): text-only model thinks there is potentially useful information for humanitarian purposes by predicting “other relevant information” whereas the image-only model thinks there is nothing related to humanitarian purposes by predicting “not-humanitarian”. However, the multimodal model effectively fuses the information coming from both modalities to make the correct prediction, i.e., “infrastructure and utility damage”. We believe these examples provide further insights about the success of the proposed multimodal approach for modeling crisis-related social media data.
5.1 Challenges and Future Work
In contrast to other popular multimodal tasks such as image captioning or visual question answering where there is strong alignment or coupling between text and image modalities, social media data are not warranted to have such strong alignment or coupling between co-occuring text and image pairs. In some cases, each modality conveys different type of information, which may even be contradicting the other modality. Therefore, it is important
In contrast to other popular multimodal tasks such as image captioning or visual question answering where there is strong alignment or coupling between text and image modalities, social media data are not warranted to have such strong alignment or coupling between co-occuring text and image pairs. In some cases, each modality conveys different type of information, which may even be contradicting the other modality. Therefore, it is importantnot to assume the existence of strong correspondences between social media text and images. To this date, this is a relatively less explored phenomenon that needs more attention from the research community since all of the existing multimodal classification approaches assume that there always exists a common label for data coming from different modalities. As such, a challenging future direction is to design a multimodal learning algorithm that can be also trained on heterogeneous input, i.e., tweet text and image pairs with disagreeing labels, in which case CrisisMMD can be used for model training in its full form.
Important informative signals gathered from different data modalities on social media can be highly useful for humanitarian organizations for disaster response. Although images shared on social media contain useful information, past studies have largely focused on text analysis, let alone combining both modalities to get better performance. In this work, we proposed to learn a joint representation using both text and image modalities of social media data. Specifically, we use state-of-the-art deep learning architectures to learn high-level feature representations from text and images to perform two classification tasks. Several experiments performed on real-world disaster-related datasets reveal the usefulness of the proposed approach. In summary, our study has two main contributions: (i) It provides baseline results, all in one place, using unimodal and multimodal approaches for both informativeness and humanitarian tasks on the CrisisMMD dataset, and (ii) it shows that a feature-fusion-based multimodal deep neural network can outperform the unimodal models on the challenging CrisisMMD dataset for both tasks, which underlines the importance of multimodal analysis of the crisis-related social media data. Despite the fact that our multimodal classifiers achieve better performance than the unimodal classifiers, we remark that there is still big room for improvement, which we leave for future work.