During natural or human-induced disasters social media has widely used to quickly disseminate information and learn useful insights. People post content (i.e., through different modalities such as text, image, and video) on social media to get help and support, identify urgent needs, or share their personal feelings. Such information is useful for humanitarian organizations to plan and launch relief operations. As the volume and velocity of the content are significantly high, it is crucial to have real-time systems to automatically process social media content to facilitate rapid response. There has been a surge of research works in this domain in the past couple of years. The focus has been to analyze the usefulness of social media data and develop computational models using different modalities to extract actionable information. Among different modalities (e.g., text and image), more focus has been given to textual content analysis compared to imagery content (see Imran et al. (2015); Said et al. (2019) for a comprehensive survey). Though many past research works have demonstrated that images shared on social media during a disaster event can help humanitarian organizations in a number of ways. For example, Nguyen et al. (2017) uses images shared on Twitter to assess the severity of the infrastructure damage, and Mouzannar et al. (2018) focuses on identifying damages in infrastructure as well as environmental elements.
For a clear understanding we report an example in Figure 0(a). It demonstrates how different disaster-related classification models can be used in real-time image categorization. As presented in the figure the four different classification tasks such as (i) disaster types, (ii) informativeness, (iii) humanitarian, and (iv) damage severity assessment, can significantly help crisis responders during disaster events. For example, disaster types model can be used to detect real-time event detection as shown in Figure 0(b). Similarly, the informativeness model can be used to filter non-informative images, the humanitarian model can be used to look at fine-grained categories, and the damage severity model can be used to assess the severity of the damages. Current literature reports either one or two tasks using one or two network architectures. Another limitation was that there has been limited datasets for disaster-related image classification. Very recently the study by Alam et al. Alam et al. (2020) developed a benchmark dataset,111We use the term Crisis Benchmark Dataset through this paper to refer to it. which is consolidated from existing publicly available resources. The development process of this dataset consists of data curation from different existing sources, development of new data for new tasks, creating non-overlapping222Duplicate images are identified in test and train sets and moved image from the test set to the train set. training, development, and test sets. The reported benchmark dataset targeted four tasks mentioned earlier.
Our work is inspired from the work of Alam et al. (2020) and in this study we utilized this dataset. We extended their work and address the above mentioned limitations by posing the following Research Questions (RQs):
RQ1: Can data consolidation helps?
RQ2: Among different neural network architectures with pretraind models which one is more suitable for different downstream disaster related image classification tasks?
Can augmentation or semi-supervised learning help to improve the performance or be more generalized?
RQ4: Can multitask learning be an ideal solution in terms of speed and computational complexity?
In order to understand the benefits of data consolidation (RQ1), we extended the work of Alam et al. Alam et al. (2020) with more in-depth analysis.
Our motivation for RQ2 is that there has been significant progress in neural network architectures for image processing in the past few years; however, they have not been widely explored in the crisis informatics333https://en.wikipedia.org/wiki/Disaster_informatics for disaster response tasks. Hence, we investigated the most popular ten neural network architectures for different disaster related image classification tasks. Since augmentation and self-training based techniques Cubuk et al. (2020); Lee and others (2013) has shown success to have a more generalized model and sometimes to improve the performance, therefore, we posed RQ3 and investigated them for the mentioned tasks.
For the real-time social media image classification tasks as shown in Figure 1, it is necessary to run the mentioned models in sequential or parallel for the same input image. Running multiple models is of course computationally expensive given that a larger number of social media images are needed to classify in real-time. The time and computational complexity can be reduced if a single model can be developed to deal with multiple tasks. We posed that with RQ4 and provide a light for future work. Note that the Crisis Benchmark Dataset has not developed for multitask learning setup. The related metadata information (e.g., image id) is available and we utilized such information to create data splits for multitask learning while tried to maintain the same train/dev/test splits. It also poses a great challenge due to incomplete/missing labels (see more details in Section 4.6).
To summarize, our contributions in this study are as follows:
We present more detailed results demonstrating the benefit of data consolidation.
We address four tasks using several state-of-the-art neural network architectures on different data splits.
We investigate the augmentation technique and show that models are more generalized with augmentation.
We explore semi-supervised learning and multitask learning to have a single model while addressing multiple tasks. Based on the findings we provide research directions for future studies.
We also provide insights of network activations using Gradient-weighted Class Activation Mapping Selvaraju et al. (2017) to demonstrate what class-specific discriminative properties network is learning.
The rest of the paper is organized as follows. Section 2 provides a brief overview of the existing work. Section 3 introduces the tasks and describes the datasets used in this study. Section 4 explains the experiments and Section 5 presents the results and discussion in Section 6. Finally, we conclude the paper in Section 7.
2 Related Work
2.1 Social Media Images for Disaster Response
The studies on image processing in the crisis informatics domain are relatively fewer compared to the studies on analyzing textual content for humanitarian aid.444https://en.wikipedia.org/wiki/Humanitarian_aid
With recent successes of deep learning for image classification, research works have started to use social media images for humanitarian aid. The importance of imagery content on social media for disaster response tasks has been reported in many studiesPeters and Porto de Albuqerque (2015); Daly and Thom (2016); Chen et al. (2013); Nguyen et al. (2017, 2017); Alam et al. (2017, 2018b). For instance, the analysis of flood images has been studied in Peters and Porto de Albuqerque (2015), in which the authors reported that the existence of images with the relevant textual content is more informative. Similarly, the study by Daly and Thom Daly and Thom (2016) analyzed fire event’s images, which are extracted from social media data. Their findings suggest that images with geotagged information are useful to locate the fire-affected areas.
The analysis of imagery content shared on social media has recently been explored using deep learning techniques for damage assessment purposes. Most of these studies categorize the severity of damage into discrete levels Nguyen et al. (2017, 2017); Alam et al. (2017) whereas others quantify the damage severity as a continuous-valued index Nia and Mori (2017); Li et al. (2018). Other related work include data scarcity issue by employing more sophisticated models such as adversarial networks Li et al. (2019); Pouyanfar et al. (2019)
, disaster image retrievalAhmad et al. (2017b), image classification in the context of bush fire emergency Lagerstrom et al. (2016), flooding photo screening system Ning et al. (2020)
, sentiment analysis from disaster imageHassan et al. (2019), monitoring natural disasters using satellite images Ahmad et al. (2017a), and flood detection using visual features Jony et al. (2019).
2.2 Real-time Systems
Recently, Alam et al. (2018b) presented an image processing pipeline to extract meaningful information from social media images during a crisis situation, which has been developed using deep learning-based techniques. Their image processing pipeline includes collecting images, removing duplicates, filtering irrelevant images, and finally classifying them with damage severity. Such a system has been used during several disaster events and one such an example is the deployment during Hurricane Dorian, reported in Imran et al. (2020). The system has been deployed for 13 days and it collected around 280K images, which are then automatically classified, and then used by a volunteer response organization, Montgomery County, Maryland Community Emergency Response Team (MCCERT). Another use case example is the early detection of disaster-related damage to cultural heritage Kumar et al. (2020).
2.3 Multimodality (Image and Text)
The exploration of multimodality has also received attention in the research community Agarwal et al. (2020); Abavisani et al. (2020). In Agarwal et al. (2020), authors explore different fusion strategies for multimodal learning. Similarly, in Abavisani et al. (2020) a cross-attention based network exploited for multimodal fusion. The study in Huang et al. (2019) reports a multimodal system for flood image detection, which achieves a precision of 87.4% in a balance test set. In another study, authors propose a similar multimodal system for on-topic vs. off-topic social media post classification and report an accuracy of 92.94% with imagery content. The study in Feng and Sester (2018)
explores different classical machine learning algorithms to classify relevant vs. irrelevant tweets by using both textual and imagery information. On the imagery content, they achieved an F1 score of 87.74% using XGboostChen and Guestrin (2016). The study in Rizk et al. (2019) propose a simple, computationally inexpensive, multi-modal two-stage framework to classify tweets (text and image) with built-infrastructure damage vs. nature-damage. The study investigated their approach using a home-grown dataset and the SUN dataset Xiao et al. (2010). The study by Mouzannar et al. Mouzannar et al. (2018) proposed a multimodal dataset, which has been developed for training a damage detection model. Similarly, Ofli et al. (2020) explores unimodal as well as different multimodal modeling approaches based on a collection of multimodal social media posts.
2.4 Transfer Learning for Image Classification
For the image classification task, transfer learning has been a popular approach, where a pre-trained neural network is used to train a new model for a new taskYosinski et al. (2014); Sharif Razavian et al. (2014); Ozbulak et al. (2016); Oquab et al. (2014); Ofli et al. (2020); Mouzannar et al. (2018). For this study, we follow the same approach using different deep learning architectures.
Currently, publicly available datasets include damage severity assessment dataset Nguyen et al. (2017), CrisisMMD Alam et al. (2018) and damage identification multimodal dataset Mouzannar et al. (2018). The former dataset is only annotated for images, whereas the latter two are annotated for both text and images. Other relevant datasets are Disaster Image Retrieval from Social Media (DIRSM) Bischke et al. (2017) and MediaEval 2018 Benjamin et al. (2018). The dataset reported in Gupta et al. (2019) is constructed for detecting damage as anomaly using pre- and post- disaster images. It consists of 700,000 building annotations A similar and relevant work is the development of incident dataset Weber et al. (2020), which consists of 446684 manually labelled images with 43 incident categories. The Crisis Benchmark Dataset reported in Alam et al. (2020) is the largest so far for social media disaster image classification.
For this study we use the Crisis Benchmark Dataset and our study differs from Alam et al. (2020) in a number of ways. We provide more detail experimental results on dataset comparison (i.e., individual vs. consolidated), compare different network architectures with statistical significant test, report the capability of data-augmentation. We have also utilized a large unlabeled dataset to enhance the capability of the current model. We created multitask data splits from Crisis Benchmark Dataset and report experimental results using both missing/incomplete and complete labels, which can serve as baseline for future works.
3 Tasks and Datasets
For this study, we addressed four different disaster-related tasks that are important for humanitarian aid. Below we provide details of each task and the associated class labels.
3.1.1 Disaster type detection
When ingesting images from unfiltered social media streams, it is important to automatically detect different disaster types those images show. For instance, an image can depict a wildfire, flood, earthquake, hurricane, and other types of disasters. In the literature, disaster types have been defined in different hierarchical categories such as natural, man-made, and hybrid Shaluf (2007). Natural disasters are events that result from natural phenomena (e.g., fire, flood, earthquake). Man-made disasters are events that result from human actions (e.g., terrorist attack, accidents, war, and conflicts). Hybrid disasters are events that result from human actions, which effect natural phenomena (e.g., deforestation results in soil erosion, and climate change). The class labels include (i) earthquake, (ii) fire, (iii) flood, (iv) hurricane, (v) landslide, (vi) other disaster – to cover all other disaster types (e.g., plane crash), and (vii) not disaster – for images that do not show any identifiable disasters.
Images posted on social media during disasters do not always contain informative (e.g., image showing damaged infrastructure due to flood, fire or any other disaster events) or useful content for humanitarian aid. It is necessary to remove any irrelevant or redundant content to facilitate crisis responders’ efforts more effectively. Therefore, the purpose of this classification task is to filter irrelevant images. The class labels for this task are (i) informative and (ii) not informative.
An important aspect of crisis responders is to assist people based on their needs, which requires information to be classified into more fine-grained categories to take specific actions. In the literature, humanitarian categories often include affected individuals; injured or dead people; infrastructure and utility damage; missing or found people; rescue, volunteering, or donation effort; and vehicle damage Alam et al. (2018). In this study, we focus on four categories that are deemed to be the most prominent and important for crisis responders such as (i) affected, injured, or dead people, (ii) infrastructure and utility damage, (iii) rescue volunteering or donation effort, and (iv) not humanitarian.
3.1.4 Damage severity
Assessing the severity of the damage is important to help the affected community during disaster events. The severity of damage can be assessed based on the physical destruction to a built-structure visible in an image (e.g., destruction of bridges, roads, buildings, burned houses, and forests). Following the work reported in Nguyen et al. (2017), we define the categories for this classification task as (i) severe damage, (ii) mild damage, and (iii) little or none.
Figure 2 shows an example image that illustrates the labels for all four tasks.
As mentioned earlier, we used the dataset reported in Alam et al. (2020).555https://crisisnlp.qcri.org/crisis-image-datasets-asonam20 Dataset has been developed by curating existing publicly available sources, created non-overlapping train/dev/test splits and made them available. For the sake of clarity and completeness we provide a brief overview of the dataset. More details of the dataset curation and consolidation process can be found in Alam et al. (2020).
3.2.1 Damage Assessment Dataset (DAD)
The damage assessment dataset consists of labeled imagery data with damage severity levels such as severe, mild, and little-to-no damage Nguyen et al. (2017). The images have been collected from two sources: AIDR Imran et al. (2014) and Google. To crawl data from Google, authors used the following keywords: damage building, damage bridge, and damage road. The images from AIDR were collected from Twitter during different disaster events such as Typhoon Ruby, Nepal Earthquake, Ecuador Earthquake, and Hurricane Matthew. The dataset contains images annotated by paid-workers as well as volunteers. In this study, we use this dataset for the informativeness and damage severity tasks. For the informativeness task, the study in Alam et al. (2020) mapped the mild and severe images into informative class and manually categorized the little-to-no damage images into informative and not informative categories. For the damage severity task, the label little-to-no damage mapped into little or none to align with other datasets.
This is a multimodal (i.e., text and image) dataset, which consists of images collected from tweets during seven disaster events crawled by the AIDR system Alam et al. (2018). The data is annotated by crowd-workers using the Figure-Eight platform666Currently acquired by https://appen.com/ for three different tasks: (i) informativeness with binary labels (i.e., informative vs. not informative), (ii) humanitarian with seven class labels (i.e., infrastructure and utility damage, vehicle damage, rescue, volunteering, or donation effort, injured or dead people, affected individuals, missing or found people, other relevant information and not relevant), (iii) damage severity assessment with three labels (i.e., severe, mild and little or no damage). For the humanitarian task similar class labels are grouped together. The images with labels injured or dead people and affected individuals are mapped into one class label affected, injured, or dead people; infrastructure and utility damage and vehicle damage are mapped into infrastructure and utility damage; other relevant information, and not relevant are mapped into not humanitarian. The images with label missing or found people are removed as it is difficult to identify. This results in four class labels for humanitarian task.
3.2.3 AIDR Disaster Type Dataset (AIDR-DT)
AIDR-DT dataset consists of tweets collected from 17 disaster events and 3 general collections. The tweets of these collections have been collected by the AIDR system Imran et al. (2014). The 17 disaster events include flood, earthquake, fire, hurricane, terrorist attack, and armed-conflict. The tweets in general collections contain keywords related to natural disasters, human-induced disasters, and security incidents. Images are crawled from these collections for disaster type annotation. The labeling of these images was performed in two steps. First, a set of images were labeled as earthquake, fire, flood, hurricane, and none of these categories. Then, a sample of 2200 images were selected and labeled as none of these categories in the previous step for annotating not disaster and other disaster categories.
For the landslide category, images are crawled from Google, Bing, and Flickr using keywords landslide, mudslide, “mud slides”, landslip, “rock slides”, rockfall, “land slide”, earthslip, rockslide, and “land collapse”. As images have been collected from different sources, therefore, it resulted in having duplicates. To take this into account, duplicate filtering has been applied to remove exact- and near-duplicate images. Then, the remaining images were manually labeled as landslide and not landslide. The resulted annotated dataset consists of labeled images with seven categories as mentioned in Section 3.1.1.
3.2.4 Damage Multimodal Dataset (DMD)
The multimodal damage identification dataset consists of 5,878 images collected from Instagram and Google Mouzannar et al. (2018)
. Authors of the study crawled the images using more than 100 hashtags, which are proposed in crisis lexiconOlteanu et al. (2014). The manually labeled data consist of six damage class labels such as fires, floods, natural landscape, infrastructural, human, and non-damage. The non-damage image includes cartoons, advertisements, and images that are not relevant or useful for humanitarian tasks. The study by Alam et al. (2020) re-labeled images for all four tasks disaster type, informativeness, humanitarian, and damage severity tasks using the same class labels discussed in the previous section.
|Affected, injured, or dead people||521||51||100||672|
|Infrastructure and utility damage||3040||299||589||3928|
|Rescue volunteering or donation effort||1682||174||375||2231|
|Affected, injured, or dead people||242||28||63||333|
|Infrastructure and utility damage||933||125||242||1300|
|Rescue volunteering or donation effort||74||9||18||101|
|Little or none||7881||1101||1566||10548|
|Little or none||317||35||67||419|
|Little or none||2874||331||778||3983|
|Affected, injured, or dead people||772||73||160||1005|
|Infrastructure and utility damage||4001||406||821||5228|
|Rescue volunteering or donation effort||1769||172||391||2332|
|Little or none||11437||1378||2135||14950|
3.3 Data Split
Before consolidating the datasets, each dataset has been divided into train, dev, and test sets with 70:10:20 ratio, respectively. The purpose was threefold: (i) train and evaluate individual datasets on each task, (ii) have a close-to-equal distribution from each dataset into the final consolidated dataset, and (iii) provide the research community an opportunity to use the splits independently. After data split, duplicate images are identified across sets and move them into the training set to create a non-overlapping test set.
3.4 Data Consolidation
One of the important reasons to perform data consolidation is to develop robust deep learning models with large amounts of data. For this purpose, all train, dev, and test sets are merged into the consolidated train, dev, and test sets, respectively. While doing so duplicate images are identified in dev and test sets, then moved into train set to create non-overlapping sets for different tasks. More detail of the duplicate identification process can be found in Alam et al. (2020).
3.5 Data Statistics
show the label distribution of all datasets for all four tasks. Some class labels are skewed in individual datasets. For example, in disaster type datasets (Table1), the distribution of “other disaster” label is low in AIDR-DT dataset, whereas the distribution of “landslide” label low in DMD dataset. For the informativeness task, low distribution is observed for the “informative” label. Moreover, for the humanitarian task, we have low distribution for “rescue volunteering or donation effort” label in DMD dataset, and for the damage severity task “mild” label in CrisisMMD and DMD datasets. However, the consolidated dataset creates a fair balance across class labels for different tasks as shown in Table 5.
Our experiments consists of (i) individual vs. consolidated datasets comparison (RQ1), (ii) network architectures comparison (RQ2) on the consolidated datasets, (iii) data augmentation (RQ3), (iv) semi-supervised approach (RQ3), and (iv) multitask learning (RQ4). Below we first provide experimental setting then we discuss different experiments that we conducted for this study.
4.1 Experimental Settings
We employ the transfer learning approach to perform experiments, which has shown promising results for various visual recognition tasks in the literature Yosinski et al. (2014); Sharif Razavian et al. (2014); Ozbulak et al. (2016); Oquab et al. (2014)
. The idea of the transfer learning approach is to use existing weights of a pre-trained model. For this study, we used several neural network architectures using the PyTorch library.777https://pytorch.org/ The architectures include ResNet18, ResNet50, ResNet101 He et al. (2016), AlexNet Krizhevsky et al. (2012), VGG16 Simonyan and Zisserman (2014), DenseNet Huang et al. (2017), SqueezeNet Iandola et al. (2016), InceptionNet Szegedy et al. (2016), MobileNet Howard et al. (2017), and EfficientNet Tan and Le (2019).
We use the weights of the networks trained using ImageNetDeng et al. (2009)
to initialize our model. We adapt the last layer (i.e., softmax layer) of the network according to the particular classification task at hand instead of the original 1,000-way classification. The transfer learning approach allows us to transfer the features and the parameters of the network from the broad domain (i.e., large-scale image classification) to the specific one, in our case four different classification tasks. We train the models using the Adam optimizerKingma and Ba (2015) with an initial learning rate of
, which is decreased by a factor of 10 when accuracy on the dev set stops improving for 10 epochs. The models were trained for 150 epochs.
We designed the binary classifier for the informativeness task and multiclass classifiers for other tasks.
To measure the performance of each classifier, we use weighted average precision (P), recall (R), and F1-measure (F1). We only report F1-measure due to limited space.
4.2 Datasets Comparison
To determine whether consolidated data helps achieve better performance, we train the models using training sets from the individual and consolidated datasets. However, we always test the models on the consolidated test set. As our test data is the same across different experiments, results are ensured to be comparable. Since we have four different tasks, which consist of fifteen different datasets, we only experimented with the ResNet18 He et al. (2016) network architecture to manage the computational load.
4.3 Network Architectures
Currently available neural network architectures come with different computational complexity. As one of our goals is to deploy the models in real-time applications, we exploit them to understand their performance differences. Another motivation is that current literature in crisis informatics only reports results using one or two network architectures (e.g., VGG16 in Ofli et al. (2020), InceptionNet in Mouzannar et al. (2018)), which we wanted to extend in this study.
4.4 Data Augmentation
Data augmentation is a commonly used technique to improve the generalization of deep neural networks in the absence of large-scale datasets. We experiment with the recently proposed RandAugment Cubuk et al. (2020) method for image augmentation. In literature, RandAugment was proposed as a fast alternative for learned augmentation strategies. We used the PyTorch implementation888https://github.com/ildoonet/pytorch-randaugment in our experiments. To increase the diversity of generated examples we used the following 16 different transformations:
The augmentation strengths can be controlled with two tunable parameters:
: the number of augmentation transformations to apply sequentially
: magnitude for all the transformations.
Each transformation resides on an integer scale from 0 to 30, with 30 being the maximum strength. In our experiments, we use constant magnitude for all augmentations. The augmentation method then boils down to randomly selecting transformations and applying each transformation sequentially with strength corresponding to scale .
In addition, we used weight decay, which is one of the most commonly used techniques for regularizing parametric machine learning models Moody et al. (1995). This helps to reduce the overfitting of the models and avoids exploding gradient.
We have conducted the data augmentation experiments using all nine different neural network architectures. We used a weight decay of and other hyper-parameters remain the same as discussed in Section 4.1.
4.5 Semi-supervised Learning
State of the art image classification models is often trained with a large amount of labeled data, which is prohibitively expensive to collect in many applications. Semi-supervised learning is a powerful approach to mitigate this issue and leverage unlabeled data to improve the performance of machine learning models. Since unlabeled data can be obtained without significant human labor, performance boost gained from semi-supervised learning comes at low cost and can be scaled easily. In literature many semi-supervised techniques has been proposed focusing on deep learning Xie et al. (2020); Sohn et al. (2020); Berthelot et al. (2019a, b); Laine and Aila (2016); Lee and others (2013); McLachlan (1975); Sajjadi et al. (2016); Tarvainen and Valpola (2017); Verma et al. (2019); Xie et al. (2019); Alam et al. (2018a). Among them self-training approach is one of the earliest Scudder (1965), which has been adopted for deep neural network. The self-training approach, also called pseudo-labeling Lee and others (2013), uses the model’s prediction as a label and retrain the model against it.
For this study, we use Noisy student (i.e, a simple self-training approach) training, which was proposed in Xie et al. (2020) as a semi-supervised learning approach to improve accuracy and robustness of state of the art image classification models. The algorithm consists of three main steps:
Train a teacher model on labeled images
Use the teacher model to generate pseudo labels on unlabeled images
Train a student model on combined labeled and pseudo labeled images
The algorithm can be iterated multiple times by treating the student as the new teacher and labeling the unlabeled images with it. During the learning of the student, different noises can be injected, such as dropout Srivastava et al. (2014) and data augmentation via RandAugment Cubuk et al. (2020). The student model is made larger than or equal to the teacher. The presence of noise and larger model capacity help the student model generalize better than the teacher.
As for the labeled dataset, we used our consolidated datasets and ran the experiments for all tasks.
To obtain unlabeled images, we crawled images from the tweets of the 20 different disaster collections (as mentioned in Section 3.2.3). We removed duplicates and made sure the same images are not in our labeled dataset by matching their ids and applying duplicate filtering. The resulting unlabeled dataset consists of 1514497 images, which we used in our experiments.
We ran our experiments using the EfficientNet (b1) architecture as it was performing better compared to the other models. In addition, it is one of the models used with Noisy student experiments reported in Xie et al. (2020). One significant difference between their work in Xie et al. (2020)
and our work is that we initialize our student model’s weight with ImageNet pretrained weight. In contrast, inXie et al. (2020), they initialize weights from scratch. Our labeled dataset is significantly small compared to the ImageNet dataset. As such, in our experiments, training from scratch substantially degrades performance.
We first trained the model using the EfficientNet (b1) architecture on the labeled dataset (Step 1), which is referred to as the teacher model. We then predicted output for the unlabeled images (Step 2). We then trained the student EfficientNet (b1) model by combining labeled and pseudo labeled images (Step 3). In this step, for the unlabeled data, we performed different filtering and balancing. We selected the images that have a confidence label greater than a certain task-specific threshold. After this, we balanced the training data so that each class has the same number of images as the class having the lowest number of images. To do this, for each class, we take the images having the highest confidence scores.
For the experiments, we used a batch size of 16 for labeled images and 48 for unlabeled images. Labeled and unlabeled images are concatenated together to compute the average cross-entropy loss. We used RandAugment with the number of augmentation, , and the strength of augmentation, . We optimized the confidence thresholds separately for different tasks using the dev sets. The thresholds for disaster types, informativeness, humanitarian, and damage severity tasks were respectively 0.7, 0.8, 0.45, and 0.45. Similar to the data augmentation experiments we used a weight decay of and kept other hyper-parameters the same as discussed in Section 4.1.
|Affected injured or dead people||537||115||353||1005|
|Infrastructure and utility damage||2397||736||2095||5228|
|Rescue volunteering or donation effort||1312||268||752||2332|
|Little or none||9124||1677||4149||14950|
|Two tasks: Info and Hum|
|Affected injured or dead people||426||72||166||664|
|Infrastructure and utility damage||410||81||210||701|
|Rescue volunteering or donation effort||1274||246||688||2208|
|Two tasks: Info and damage severity|
|Little or none||7085||1094||2369||10548|
|Affected injured or dead people||85||34||164||283|
|Infrastructure and utility damage||398||230||764||1392|
|Rescue volunteering or donation effort||26||14||53||93|
|Little or none||1805||494||1571||3870|
4.6 Multi-task Learning
Since the tasks share similar properties, we also consider training the model in multi-task settings with shared parameters. The benefits of multi-task settings can be twofold: (i) learning shared representation can help the model generalize better and improve performance on individual tasks, and (ii) training a single model instead of four different models will yield a significant speed and reduce computational load during training and inference. It is important to mention that the Crisis Benchmark Dataset was not designed for multitask learning rather it was prepared for each task separately. Hence, we needed to prepare them for the multitask setup. Creating multitask learning datasets from Crisis Benchmark Dataset introduced a challenge – there is a overlap between train and test set images among different tasks. Hence, we prepare the datasets for the multitask setting using the following strategy:
We merge the test sets from different tasks into a combined test set. If an image in the combined test set is present in the train or dev set of some tasks, we remove it from that split and add the label in the test set.
We merge the dev sets of the four tasks into the combined dev set. If an image in the combined dev set is present in the train set of some tasks, we remove it from that train split and add the label in the dev set.
We merge the train sets of the four tasks into the combined train set. Since we have removed images that overlap with the dev set and test set in the previous steps, this guarantees no image from the train set will be present in the other splits.
Since all images do not have annotation for all four tasks, there is a discrepancy in the number of images available for different tasks. We report the distribution of the data splits for the multi-task setting In Table 6. Overall, there are 49353 images in the train set, 6157 images in the dev set, and 15688 images in the test set. Due to the overlap of images in different splits for different tasks, there is also a discrepancy between the number of images available between multi-task and single-task settings. As an example, for the disaster types task, there are 12846 images in the train set, 1470 images in the dev set, and 3195 images in the test set in the single-task setting. However, in the multi-task setting, these numbers are respectively 10996, 1797, and 4718. As a consequence of our merging procedure, there are more images in the test and dev sets and fewer images in the train set.
Few approaches have been proposed in the literature to address the issue of incomplete/missing labels in multi-task settings. They usually work by generating missing task labels using different methods, including Bayesian networksKapoor et al. (2012), rule-based approach Kollias and Zafeiriou (2019), knowledge distillation from another model Deng et al. (2020). In our experiments, we opt for a simpler alternative. Specifically, we do not compute loss for a task if its label is missing. Since the tasks have varying training images, we calculate the loss for each task and aggregate them in a batch. This ensures that the loss of each task is weighted equally. The process is detailed in Algorithm 1.
We also experiment with images having completely aligned labels for different tasks. We identified three such combinations that have a substantial number of images in different classes. Two of them belong to two task subsets. The first one is informativeness and humanitarian, which has 7960 total aligned images. The second one is informativeness and damage severity, having 25830 total images. Data distribution for these two settings is reported in Table 7. The final subset of images having labels for all four tasks, which consists of 5558 images. Data distribution for this set is reported in Table 8.
Our experimental results consist of different settings. Below we discuss each of them in details.
|Disaster types (7 classes)|
|Informativeness (2 classes)|
|Humanitarian (4 classes)|
|Damage severity (3 classes)|
5.1 Dataset Comparison
In Table 9, we report classification results for different tasks and different datasets using ResNet18 network architecture. The performance of different tasks is not equally comparable as they have different levels of complexity (e.g., varying number of class labels, class imbalance, etc.). For example, the informativeness classification is a binary task, which is computationally simpler than a classification task with more labels (e.g., seven labels in disaster types). Hence, the performance is comparatively higher for informativeness. An example of a class imbalance issue can be seen in Table 5 with the damage severity task. The distribution of mild is comparatively small, which reflects on its and overall performance. The mild class label is also less distinctive than other class labels, and we noticed that classifiers often confuse this class label with the other two class labels. Similar findings have also been reported in Nguyen et al. (2017). For the disaster types task, the performance of the AIDR-DT model is higher compared to the DMD model. We observe that the DMD dataset is comparatively small and the model is not performing well on the consolidated dataset. This characteristic is observed in other tasks as well. For the damage severity task, CrisisMMD is performing worse, which is also reflected on its dataset size, i.e., 2493 images in the training set as can be seen in Table 4. As expected, overall for all tasks, the models with the consolidated datasets outperform individual datasets.
|Model||# Layer||# Param (M)||Memory (MB)|
5.2 Network Architectures Comparison
In Table 10, we report results using different network architectures on consolidated datasets for different tasks, i.e., trained and tested using a consolidated dataset. Across different tasks, overall EfficientNet (b1) is performing better than other models as shown in Figure 3, except for humanitarian task, for which VGG16 is outperforming other models. Comparatively the second-best models are VGG16, ResNet50, ResNet101, and DenseNet (101). From the results of different tasks, we observe that InceptionNet (v3) is the worst performing model.
The performance difference among different models such as EfficientNet (b1), VGG16, ResNet50, ResNet101, and DenseNet (101) are low, hence, we have done statistical test to understand whether such small differences are significant. We used McNemar’s test for binary classification task, (i.e., informativeness) and Bowker’s test for other multiclass classification tasks. More details of this test can be found in Hoffman (2019). We have done such tests between two models to see pair-wise difference. In Figure 4 and 5, we report the results of significant test. The value in the cell represent the -value and the light yellow color represent they are statistically significant with . From the Figure 4, we see that for disaster types task the -value is higher than in comparison between EfficientNet (b1) vs. ResNet50, ResNet101 and DenseNet (121), which clearly shows among the results in Table 10. Similarly the difference is very low between EfficientNet (b1) vs. VGG16 and DenseNet (121). For humanitarian and damage severity tasks, we observed similar behaviors. By analyzing all four tasks it appears VGG16 is the second best performing model.
In Table 11, we also report different neural network models with their number of layers, parameters, and memory consumption during the inference of informativeness task. There can always be a trade-off between performance vs. computational complexity, i.e., number of layers, parameters, and memory consumption. In terms of memory consumption and the number of parameters, VGG16 seems expensive than others. Based on the performance and computational complexity, we can conclude that EfficientNet can be the best option to use in real-time applications. We computed throughput for EfficientNet using a batch size of 128 and it can process 260 images per second on an NVIDIA Tesla P100 GPU. Among different ResNet models, ResNet18 is a reasonable choice given that its computational complexity is significantly less than other ResNet models.
5.3 Data Augmentation
To reduce the overfitting and to have more generalized models, we used data augmentation and weight decay. In Table 12, we report the results for all tasks and using all network architectures. The column Diff. report the difference between the results presented in Table 10 where no RandAugment or weight decay has been applied. The improved results are highlighted with light blue color for all tasks. Out of 40 experiments (10 network architectures ✕ 4 tasks), for 26 cases, the augmentation with weight decay improved the performances.
On the improved cases, we also computed a statistical significance test between no RandAugment and RandAugment with with weight decay models. We found that the improvements for the models with InceptionNet (v3) are statistically significant in all tasks. Only the improved performance with EfficientNet (b1) for damage severity task is statistically significant, and for other tasks, they are not statistically significant. We investigated training and validation losses over the number of epochs. In Figure 6 and 7, we report training, validation losses and accuracies for EfficientNet (b1) model for Informativeness and Humanitarian tasks, respectively. From the figures 5(a) and 6(a), we clearly see that models are overfitting, whereas figures 5(b) and 6(b) show that models are more generalized. These findings demonstrate the benefits of augmentation and weight decay.
5.4 Semi-supervised Learning
In Table 13, we present the results of Noisy student based self-training approach along without/with RandAugment results. We have an improvement for the Informativeness task. For the Humanitarian task, the performance is similar to RandAugment. For the Damage severity task, the performance of Noisy student is same as without RandAugment but lower than RandAugment.
We postulate following possible reasons for lack of improvements in semi-supervised learning experiments:
Semi-supervised learning usually performs better when trained from scratch instead of fine-tuning from a pretrained model. This phenomenon is explored in Zhou et al. (2018) where the authors reported the performance gained from semi-supervised learning methods are usually smaller when trained from a pretrained model. We could not train the student model from scratch as our labeled datasets are small, and it degrades performance even more.
We had to use a much smaller labeled batch size of 16 compared to those used in Xie et al. (2020) (512 or higher) due to GPU constraints. Having a larger labeled batch size and, consequently, more unlabeled images in each batch may yield a better result.
|Two tasks: Info and DS|
|Two tasks: Info and Hum|
|Four tasks: DT, Info, Hum and DS|
5.5 Multi-task Learning
Since the Crisis Benchmark Dataset has not designed to address the multitask learning, therefore, we needed to re-split them as discussed in Section 4.6. This resulted two different settings: (i) incomplete/missing labels, and (ii) complete aligned labels. The incomplete/missing labels in multitask learning is a challenging problem, which we addressed using masking, i.e., for an unlabeled output we are not computing loss for that particular task. In Table 14, we report the results of multitask learning with missing labels where we address all tasks. We also investigated different tasks combinations where all labels are present. In Table 15, we report the results of different tasks combinations where they have complete aligned labels. For different task combinations performances differ due to their data sizes, label distribution and task settings. The results with multitask learning is not exactly comparable with our single task setup. They can serve as a baseline for future studies.
5.6 Visual Explanation using Grad-CAM
We explore how the neural networks arrive at their decision by utilizing Gradient-weighted Class Activation Mapping (Grad-CAM) Selvaraju et al. (2017). Grad-CAM uses the gradient of a target class flowing into the final convolution layer to produce a localization map highlighting the important regions in the image for that specific class. We use the implementation provided in 999https://github.com/FrancescoSaverioZuppichini/A-journey-into-Convolutional-Neural-Network-visualization-. We display results for two candidate networks: VGG16 and EfficientNet on two tasks: informativeness and disaster types. We use the models trained using RandAugment for this experiment.
In Figure 8, we show the activation map for the predicted class for some images from the informativeness test set. From these images, it seems that EfficientNet performs better for localizing important regions in the image for the class of interest. VGG16 tends to depend on smaller regions for decision making. The last row shows an image where VGG16 misclassified an informative image as not informative.
We show the activation map for some images from the test set of the disaster types task in Figure 9. Here, the difference in localization quality between the two models is even more pronounced. The activation maps from VGG are difficult to interpret in the first and third images, even though the model classifies them correctly. The second image shows that VGG may focus on the smoke regions for classifying fire images. This explains why it identifies the last image as fire, mistaking the clouds as smoke.
Overall, these results suggest that EfficientNet not only outperforms other models in the numeric measures, it also produces results that are easier to interpret.
6 Discussions and Future Works
6.1 Our Findings
Real-time event detection is an important problem from social media content. Our proposed pipeline and models are suitable to deploy them real-time applications. The proposed models can also be used independently. For example, disaster types model can be used to monitor real-time disaster events.
Our experiments were based on the research questions discussed in Section 1 below we report our findings based on them.
RQ1: Our investigation to dataset comparison suggests that data consolidation helps, which answers our first research question.
RQ2: We also explore several deep learning models, which vary with performance and complexities. Among them, EfficientNet (b1) appears to be a reasonable option. Note that EfficientNet (b1) has a series of network architectures (b0-b7) and for this study, we only reported results with EfficientNet (b1). We aim to further explore other architectures. A small and low latency model is desired to deploy mobile and handheld embedded computer vision applications. The development of MobileNet Howard et al. (2017) sheds light towards that direction. Our experimental results suggest that it is computationally simpler and provides a reasonable accuracy, only 2-3% lower than the best models for different tasks. These findings answer out second research question.
RQ3: We observe that strong data augmentation can improve performance, although this is not consistent across different tasks and models. Semi-supervised learning does not usually yield performance when trained using pretrained models and can sometimes even degrade it.
RQ4: Multi-task learning can be an ideal solution for the real-time system as it can potentially provide speed-ups of multiple factors during inference. However, some tasks may perform worse than their single task settings in the presence of incomplete labels. Having aligned complete labels for different tasks can mitigate this issue.
|Ref.||Dataset||# image||# C||Cls.||Task||Models||Data Split||Acc||P||R||F1|
|Ofli et al. (2020)||CrisisMMD||12,708||2||B||Info||VGG16||Train/dev/test||0.833||0.831||0.833||0.832|
|Ofli et al. (2020)||CrisisMMD||8,079||5||M||Hum||VGG16||Train/dev/test||0.768||0.764||0.768||0.763|
|Mouzannar et al. (2018)||DMD||5879||6||M||Event||Incep||4 folds CV||0.840||-||-||-|
|Agarwal et al. (2020)||CrisisMMD||18,126||2||B||Info||Incep||5 folds CV||-||0.820||0.820||0.820|
|Agarwal et al. (2020)||CrisisMMD||18,126||2||B||Infra.||Incep||5 folds CV||-||0.920||0.920||0.920|
|Agarwal et al. (2020)||CrisisMMD||18,126||3||B||Severity||Incep||5 folds CV||-||0.950||0.940||0.940|
|Abavisani et al. (2020)||CrisisMMD||11,250||2||B||Info||DenseNet||Train/dev/test||0.816||-||-||0.812|
|Abavisani et al. (2020)||CrisisMMD||3,359||5||B||Hum||DenseNet||Train/dev/test||0.834||-||-||0.870|
|Abavisani et al. (2020)||CrisisMMD||3,288||3||B||Severity||DenseNet||Train/dev/test||0.629||-||-||0.661|
6.2 Comparing Previous State-of-art
We compared our results with recent and related previous state-of-the-art results, reported in Table 16. However, it is not possible to have an end-to-end comparison for a few possible reasons: (i) different datasets and sizes – see column second and third in Table 16, (ii) different data splits (train/dev/test vs. Cross Validation (CV) fold) even using same dataset – see column Data Split in the same Table, (iii) different evaluation measures such as weighted P/R/F1-measure (first two rows) Ofli et al. (2020) vs. accuracy (3rd row) Mouzannar et al. (2018) vs. CV fold (4th to 6th rows – unspecified in Agarwal et al. (2020) whether measures are macro, micro or weighted).
Even if they are not exactly comparable, however, we observe that on informativeness and humanitarian tasks, previously reported results (weighted F1) are 0.832 and 0.763, respectively, using the CrisisMMD dataset Ofli et al. (2020). The authors in Mouzannar et al. (2018) reported a test accuracy of for six disaster types tasks using the DMD dataset with a five-fold cross-validation run. The study in Agarwal et al. (2020) report an F1 of 0.820 for informativeness, 0.920 for infrastructure damage, and 0.940 for damage severity. In another study, using the CrisisMMD dataset, authors report weighted-F1 of 0.812 and 0.870 for informativeness and humanitarian tasks, respectively Abavisani et al. (2020). They used a small subset of the whole CrisisMMD dataset in their study. From the Table 16 we observe that the F1 for informativeness task ranges from 0.812 to 0.832 across studies, for humanitarian task it varies from 0.763 to 0.870, and for damage severity it varies from 0.661 to 0.940. Compared to them our best results (weighted F1) for disaster types, informativeness, humanitarian and damage severity are 0.835, 0.876, 0.784, and 0.765, respectively, on the consolidated single task dataset.
6.3 Future Works
As for future work we foresee several interesting research avenues. (i) Exploring more in-depth on semi-supervised learning to leverage a large amount of unlabelled social media data and address the limitations that we highlighted in Section 5.4. We believe addressing such limitations can help to improve the performance of the current models. (ii) In multi-task setup, one possible research direction is to address the problem of incomplete/missing labels, and the other is manually labeling Crisis Benchmark Dataset for incomplete labels for all tasks. Both approaches will allow the community ground for explore multi-task study for real-time social media image classification.
The imagery and textual content available on social media have been used by humanitarian organizations in times of disaster events. There has been limited work for disaster response image classification tasks compared to text. In this study, we addressed four tasks such as disaster types, informativeness, humanitarian and damage severity, that are needed for disaster response in real-time. Our experimental results on individual and consolidated datasets suggest that data consolidation helps. We investigated four tasks using different state-of-art neural network architectures and reported the best models. The findings on data augmentation suggest that a more generalized model can be obtained with such approaches. Our investigation on semi-supervised and multitask learning shows new research directions for the community. We also provide some insights of activation maps to demonstrate what class-specific information a network is learning.
Compliance with ethical standards
Conflict of interest
We have no conflicts of interest or competing interests to declare.
Availability of data and material
The data used in this study are available at https://crisisnlp.qcri.org/crisis-image-datasets-asonam20.
- Multimodal categorization of crisis events in social media. In Proc. of CVPR, pp. 14679–14689. Cited by: §2.3, §6.2, Table 16.
Crisis-dias: towards multimodal damage analysis - deployment, challenges and assessment.
Proceedings of the AAAI Conference on Artificial Intelligence34 (01), pp. 346–353. External Links: Cited by: §2.3, §6.2, §6.2, Table 16.
- Jord: a system for collecting information and monitoring natural disasters by linking social media with satellite imagery. In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, pp. 1–6. Cited by: §2.1.
- Convolutional neural networks for disaster images retrieval.. In MediaEval, Cited by: §2.1.
- Image4Act: online social media image processing for disaster response.. In Proc. of ASONAM, pp. 1–4. Cited by: §2.1, §2.1.
- Graph based semi-supervised learning with convolution neural networks to classify crisis related tweets. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 12. Cited by: §4.5.
- Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response. arXiv e-prints, pp. arXiv:2011.08916. External Links: Cited by: §1, §1, §1, §2.5, §2.5, §3.2.1, §3.2.4, §3.2, §3.4.
- CrisisMMD: multimodal twitter datasets from natural disasters. In Proc. of ICWSM, pp. 465–473 (English). Cited by: §2.5, §3.1.3, §3.2.2.
- Processing social media images by combining human and machine computing during crises. International Journal of Human–Computer Interaction 34 (4), pp. 311–327. External Links: Cited by: §2.1, §2.2.
- The multimedia satellite task at MediaEval 2018: emergency response for flooding events. In MediaEval, Cited by: §2.5.
- ReMixMatch: semi-supervised learning with distribution matching and augmentation anchoring. arXiv preprint arXiv:1911.09785. Cited by: §4.5.
- MixMatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Cited by: §4.5.
- The multimedia satellite task at MediaEval 2017.. In In Proceedings of the MediaEval 2017: MediaEval Benchmark Workshop, Cited by: §2.5.
- Understanding and classifying image tweets. In ACM Multimedia, pp. 781–784. Cited by: §2.1.
- XGboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.3.
Randaugment: practical automated data augmentation with a reduced search space.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703. Cited by: §1, §4.4, §4.5.
- Mining and classifying image posts on social media to analyse fires. In Proc. of ISCRAM, pp. 1–14. Cited by: §2.1.
- Multitask emotion recognition with incomplete labels. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pp. 828–835. Cited by: §4.6.
- Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.1.
- Extraction of pluvial flood relevant volunteered geographic information (vgi) by deep learning from user generated texts and photos. ISPRS International Journal of Geo-Information 7 (2), pp. 39. Cited by: §2.3.
- Creating xbd: a dataset for assessing building damage from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.5.
- Sentiment analysis from images of natural disasters. In International Conference on Image Analysis and Processing, pp. 104–113. Cited by: §2.1.
- Deep residual learning for image recognition. In Proc. of CVPR, pp. 770–778. Cited by: §4.1, §4.2.
- Chapter 15 - categorical and cross-classified data: mcnemar’s and bowker’s tests, kolmogorov-smirnov tests, concordance. In Basic Biostatistics for Medical and Biomedical Practitioners (Second Edition), J. I.E. Hoffman (Ed.), pp. 233 – 247. External Links: Cited by: §5.2.
- Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §4.1, §6.1.
- Densely connected convolutional networks. In Proc. of CVPR, pp. 4700–4708. Cited by: §4.1.
- A visual–textual fused approach to automated tagging of flood-related tweets during a flood event. International Journal of Digital Earth 12 (11), pp. 1248–1264. Cited by: §2.3.
- SqueezeNet: alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv:1602.07360. Cited by: §4.1.
- Rapid damage assessment using social media images by combining human and machine intelligence. arXiv preprint arXiv:2004.06675. Cited by: §2.2.
- Processing social media messages in mass emergency: a survey. ACM Computing Surveys 47 (4), pp. 67. Cited by: §1.
- AIDR: artificial intelligence for disaster response. In Proc. of WWW, pp. 159–162. Cited by: §3.2.1, §3.2.3.
- Flood detection in social media images using visual features and metadata. 2019 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. Cited by: §2.1.
- Multilabel classification using bayesian compressed sensing. Advances in neural information processing systems 25, pp. 2645–2653. Cited by: §4.6.
- Adam: A method for stochastic optimization. In Proc. of ICLR, Cited by: §4.1.
- Expression, affect, action unit recognition: aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855. Cited by: §4.6.
- ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.
- Detection of disaster-affected cultural heritage sites from social media images using deep learning techniques. J. Comput. Cult. Herit. 13 (3). External Links: Cited by: §2.2.
- Image classification to support emergency situation awareness. Frontiers in Robotics and AI 3, pp. 54. External Links: Cited by: §2.1.
- Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §4.5.
- Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §1, §4.5.
- Identifying disaster damage images using a domain adaptation approach. In Proc. of ISCRAM, pp. 633–645. Cited by: §2.1.
- Localizing and quantifying damage in social media images. In Proc. of ASONAM, pp. 194–201. Cited by: §2.1.
- Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association 70 (350), pp. 365–369. Cited by: §4.5.
- A simple weight decay can improve generalization. Advances in neural information processing systems 4 (1995), pp. 950–957. Cited by: §4.4.
- Damage Identification in Social Media Posts using Multimodal Deep Learning. In Proc. of ISCRAM, pp. 529–543. Cited by: §1, §2.3, §2.4, §2.5, §3.2.4, §4.3, §6.2, §6.2, Table 16.
- Automatic image filtering on social networks using deep learning and perceptual hashing during crises. In Proc. of ISCRAM, Cited by: §2.1, §2.1.
- Damage assessment from social media imagery data during disasters. In Proc. of ASONAM, pp. 1–8. Cited by: §1, §2.1, §2.1, §2.5, §3.1.4, §3.2.1, §5.1.
- Building damage assessment using deep learning and ground-level image data. In 14th Conference on Computer and Robot Vision (CRV), pp. 95–102. Cited by: §2.1.
- Prototyping a social media flooding photo screening system based on deep learning. ISPRS International Journal of Geo-Information 9 (2), pp. 104. Cited by: §2.1.
- Analysis of social media data using multimodal deep learning for disaster response. In Proc. of ISCRAM, Cited by: §2.3, §2.4, §4.3, §6.2, §6.2, Table 16.
- CrisisLex: a lexicon for collecting and filtering microblogged communications in crises.. In Proc. of ICWSM, Cited by: §3.2.4.
- Learning and transferring mid-level image representations using convolutional neural networks. In Proc. of CVPR, pp. 1717–1724. Cited by: §2.4, §4.1.
- How transferable are cnn-based features for age and gender classification?. In International Conference of the Biometrics Special Interest Group, pp. 1–6. External Links: Cited by: §2.4, §4.1.
- Investigating images as indicators for relevant social media messages in disaster management. In Proc. of ISCRAM, Cited by: §2.1.
- Unconstrained flood event detection using adversarial data augmentation. In IEEE International Conference on Image Processing (ICIP), pp. 155–159. Cited by: §2.1.
- A computationally efficient multi-modal classification approach of disaster-related twitter images. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, New York, NY, USA, pp. 2050–2059. External Links: Cited by: §2.3.
- Natural disasters detection in social media and satellite imagery: a survey. Multimedia Tools and Applications 78 (22), pp. 31267–31302. Cited by: §1.
- Regularization with stochastic transformations and perturbations for deep semi-supervised learning. arXiv preprint arXiv:1606.04586. Cited by: §4.5.
- Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory 11 (3), pp. 363–371. Cited by: §4.5.
- Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: 5th item, §5.6.
- Disaster types. Disaster Prevention and Management: An International Journal. Cited by: §3.1.1.
- CNN features off-the-shelf: an astounding baseline for recognition. In Proc. of CVPR Workshops, pp. 806–813. Cited by: §2.4, §4.1.
- Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
- FixMatch: simplifying semi-supervised learning with consistency and confidence. In Proceedings of the Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020), Cited by: §4.5.
- Dropout: a simple way to prevent neural networks from overfitting.. Journal of MLR 15 (1), pp. 1929–1958. Cited by: §4.5.
- Rethinking the inception architecture for computer vision. In Proc. of CVPR, pp. 2818–2826. Cited by: §4.1.
- EfficientNet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946. Cited by: §4.1.
- Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780. Cited by: §4.5.
- Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825. Cited by: §4.5.
- Detecting natural disasters, damage, and incidents in the wild. In European Conference on Computer Vision, pp. 331–350. Cited by: §2.5.
SUN database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3485–3492. External Links: Cited by: §2.3.
- Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §4.5.
- Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §4.5, §4.5, §4.5, item 2.
- How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems, pp. 3320–3328. Cited by: §2.4, §4.1.
- When semi-supervised learning meets transfer learning: training strategies, models and datasets. arXiv preprint arXiv:1812.05313. Cited by: item 1.