Social Media Images Classification Models for Real-time Disaster Response

04/09/2021 ∙ by Firoj Alam, et al. ∙ Hamad Bin Khalifa University 0

Images shared on social media help crisis managers in terms of gaining situational awareness and assessing incurred damages, among other response tasks. As the volume and velocity of such content are really high, therefore, real-time image classification became an urgent need in order to take a faster response. Recent advances in computer vision and deep neural networks have enabled the development of models for real-time image classification for a number of tasks, including detecting crisis incidents, filtering irrelevant images, classifying images into specific humanitarian categories, and assessing the severity of the damage. For developing real-time robust models, it is necessary to understand the capability of the publicly available pretrained models for these tasks. In the current state-of-art of crisis informatics, it is under-explored. In this study, we address such limitations. We investigate ten different architectures for four different tasks using the largest publicly available datasets for these tasks. We also explore the data augmentation, semi-supervised techniques, and a multitask setup. In our extensive experiments, we achieve promising results.



There are no comments yet.


page 3

page 8

page 30

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

During natural or human-induced disasters social media has widely used to quickly disseminate information and learn useful insights. People post content (i.e., through different modalities such as text, image, and video) on social media to get help and support, identify urgent needs, or share their personal feelings. Such information is useful for humanitarian organizations to plan and launch relief operations. As the volume and velocity of the content are significantly high, it is crucial to have real-time systems to automatically process social media content to facilitate rapid response. There has been a surge of research works in this domain in the past couple of years. The focus has been to analyze the usefulness of social media data and develop computational models using different modalities to extract actionable information. Among different modalities (e.g., text and image), more focus has been given to textual content analysis compared to imagery content (see Imran et al. (2015); Said et al. (2019) for a comprehensive survey). Though many past research works have demonstrated that images shared on social media during a disaster event can help humanitarian organizations in a number of ways. For example, Nguyen et al. (2017) uses images shared on Twitter to assess the severity of the infrastructure damage, and Mouzannar et al. (2018) focuses on identifying damages in infrastructure as well as environmental elements.

For a clear understanding we report an example in Figure 0(a). It demonstrates how different disaster-related classification models can be used in real-time image categorization. As presented in the figure the four different classification tasks such as (i) disaster types, (ii) informativeness, (iii) humanitarian, and (iv) damage severity assessment, can significantly help crisis responders during disaster events. For example, disaster types model can be used to detect real-time event detection as shown in Figure 0(b). Similarly, the informativeness model can be used to filter non-informative images, the humanitarian model can be used to look at fine-grained categories, and the damage severity model can be used to assess the severity of the damages. Current literature reports either one or two tasks using one or two network architectures. Another limitation was that there has been limited datasets for disaster-related image classification. Very recently the study by Alam et al. Alam et al. (2020) developed a benchmark dataset,111We use the term Crisis Benchmark Dataset through this paper to refer to it. which is consolidated from existing publicly available resources. The development process of this dataset consists of data curation from different existing sources, development of new data for new tasks, creating non-overlapping222Duplicate images are identified in test and train sets and moved image from the test set to the train set. training, development, and test sets. The reported benchmark dataset targeted four tasks mentioned earlier.

(a) Disaster image classification pipeline.
(b) Event detection use case showing landslide images.
Figure 1: Disaster image classification pipeline that demonstrate a real use case – landslide image classification.

Our work is inspired from the work of Alam et al. (2020) and in this study we utilized this dataset. We extended their work and address the above mentioned limitations by posing the following Research Questions (RQs):

  • RQ1: Can data consolidation helps?

  • RQ2: Among different neural network architectures with pretraind models which one is more suitable for different downstream disaster related image classification tasks?

  • RQ3:

    Can augmentation or semi-supervised learning help to improve the performance or be more generalized?

  • RQ4: Can multitask learning be an ideal solution in terms of speed and computational complexity?

In order to understand the benefits of data consolidation (RQ1), we extended the work of Alam et al. Alam et al. (2020) with more in-depth analysis.

Our motivation for RQ2 is that there has been significant progress in neural network architectures for image processing in the past few years; however, they have not been widely explored in the crisis informatics333 for disaster response tasks. Hence, we investigated the most popular ten neural network architectures for different disaster related image classification tasks. Since augmentation and self-training based techniques Cubuk et al. (2020); Lee and others (2013) has shown success to have a more generalized model and sometimes to improve the performance, therefore, we posed RQ3 and investigated them for the mentioned tasks.

For the real-time social media image classification tasks as shown in Figure 1, it is necessary to run the mentioned models in sequential or parallel for the same input image. Running multiple models is of course computationally expensive given that a larger number of social media images are needed to classify in real-time. The time and computational complexity can be reduced if a single model can be developed to deal with multiple tasks. We posed that with RQ4 and provide a light for future work. Note that the Crisis Benchmark Dataset has not developed for multitask learning setup. The related metadata information (e.g., image id) is available and we utilized such information to create data splits for multitask learning while tried to maintain the same train/dev/test splits. It also poses a great challenge due to incomplete/missing labels (see more details in Section 4.6).

To summarize, our contributions in this study are as follows:

  • We present more detailed results demonstrating the benefit of data consolidation.

  • We address four tasks using several state-of-the-art neural network architectures on different data splits.

  • We investigate the augmentation technique and show that models are more generalized with augmentation.

  • We explore semi-supervised learning and multitask learning to have a single model while addressing multiple tasks. Based on the findings we provide research directions for future studies.

  • We also provide insights of network activations using Gradient-weighted Class Activation Mapping Selvaraju et al. (2017) to demonstrate what class-specific discriminative properties network is learning.

The rest of the paper is organized as follows. Section 2 provides a brief overview of the existing work. Section 3 introduces the tasks and describes the datasets used in this study. Section 4 explains the experiments and Section 5 presents the results and discussion in Section 6. Finally, we conclude the paper in Section 7.

2 Related Work

2.1 Social Media Images for Disaster Response

The studies on image processing in the crisis informatics domain are relatively fewer compared to the studies on analyzing textual content for humanitarian aid.444

With recent successes of deep learning for image classification, research works have started to use social media images for humanitarian aid. The importance of imagery content on social media for disaster response tasks has been reported in many studies 

Peters and Porto de Albuqerque (2015); Daly and Thom (2016); Chen et al. (2013); Nguyen et al. (2017, 2017); Alam et al. (2017, 2018b). For instance, the analysis of flood images has been studied in Peters and Porto de Albuqerque (2015), in which the authors reported that the existence of images with the relevant textual content is more informative. Similarly, the study by Daly and Thom Daly and Thom (2016) analyzed fire event’s images, which are extracted from social media data. Their findings suggest that images with geotagged information are useful to locate the fire-affected areas.

The analysis of imagery content shared on social media has recently been explored using deep learning techniques for damage assessment purposes. Most of these studies categorize the severity of damage into discrete levels Nguyen et al. (2017, 2017); Alam et al. (2017) whereas others quantify the damage severity as a continuous-valued index Nia and Mori (2017); Li et al. (2018). Other related work include data scarcity issue by employing more sophisticated models such as adversarial networks Li et al. (2019); Pouyanfar et al. (2019)

, disaster image retrieval

Ahmad et al. (2017b), image classification in the context of bush fire emergency Lagerstrom et al. (2016), flooding photo screening system Ning et al. (2020)

, sentiment analysis from disaster image

Hassan et al. (2019), monitoring natural disasters using satellite images Ahmad et al. (2017a), and flood detection using visual features Jony et al. (2019).

2.2 Real-time Systems

Recently, Alam et al. (2018b) presented an image processing pipeline to extract meaningful information from social media images during a crisis situation, which has been developed using deep learning-based techniques. Their image processing pipeline includes collecting images, removing duplicates, filtering irrelevant images, and finally classifying them with damage severity. Such a system has been used during several disaster events and one such an example is the deployment during Hurricane Dorian, reported in Imran et al. (2020). The system has been deployed for 13 days and it collected around 280K images, which are then automatically classified, and then used by a volunteer response organization, Montgomery County, Maryland Community Emergency Response Team (MCCERT). Another use case example is the early detection of disaster-related damage to cultural heritage Kumar et al. (2020).

2.3 Multimodality (Image and Text)

The exploration of multimodality has also received attention in the research community Agarwal et al. (2020); Abavisani et al. (2020). In Agarwal et al. (2020), authors explore different fusion strategies for multimodal learning. Similarly, in Abavisani et al. (2020) a cross-attention based network exploited for multimodal fusion. The study in Huang et al. (2019) reports a multimodal system for flood image detection, which achieves a precision of 87.4% in a balance test set. In another study, authors propose a similar multimodal system for on-topic vs. off-topic social media post classification and report an accuracy of 92.94% with imagery content. The study in Feng and Sester (2018)

explores different classical machine learning algorithms to classify relevant vs. irrelevant tweets by using both textual and imagery information. On the imagery content, they achieved an F1 score of 87.74% using XGboost

Chen and Guestrin (2016). The study in Rizk et al. (2019) propose a simple, computationally inexpensive, multi-modal two-stage framework to classify tweets (text and image) with built-infrastructure damage vs. nature-damage. The study investigated their approach using a home-grown dataset and the SUN dataset Xiao et al. (2010). The study by Mouzannar et al. Mouzannar et al. (2018) proposed a multimodal dataset, which has been developed for training a damage detection model. Similarly, Ofli et al. (2020) explores unimodal as well as different multimodal modeling approaches based on a collection of multimodal social media posts.

2.4 Transfer Learning for Image Classification

For the image classification task, transfer learning has been a popular approach, where a pre-trained neural network is used to train a new model for a new task 

Yosinski et al. (2014); Sharif Razavian et al. (2014); Ozbulak et al. (2016); Oquab et al. (2014); Ofli et al. (2020); Mouzannar et al. (2018). For this study, we follow the same approach using different deep learning architectures.

2.5 Datasets

Currently, publicly available datasets include damage severity assessment dataset Nguyen et al. (2017), CrisisMMD Alam et al. (2018) and damage identification multimodal dataset Mouzannar et al. (2018). The former dataset is only annotated for images, whereas the latter two are annotated for both text and images. Other relevant datasets are Disaster Image Retrieval from Social Media (DIRSM) Bischke et al. (2017) and MediaEval 2018 Benjamin et al. (2018). The dataset reported in Gupta et al. (2019) is constructed for detecting damage as anomaly using pre- and post- disaster images. It consists of 700,000 building annotations A similar and relevant work is the development of incident dataset Weber et al. (2020), which consists of 446684 manually labelled images with 43 incident categories. The Crisis Benchmark Dataset reported in Alam et al. (2020) is the largest so far for social media disaster image classification.

For this study we use the Crisis Benchmark Dataset and our study differs from Alam et al. (2020) in a number of ways. We provide more detail experimental results on dataset comparison (i.e., individual vs. consolidated), compare different network architectures with statistical significant test, report the capability of data-augmentation. We have also utilized a large unlabeled dataset to enhance the capability of the current model. We created multitask data splits from Crisis Benchmark Dataset and report experimental results using both missing/incomplete and complete labels, which can serve as baseline for future works.

3 Tasks and Datasets

For this study, we addressed four different disaster-related tasks that are important for humanitarian aid. Below we provide details of each task and the associated class labels.

3.1 Tasks

3.1.1 Disaster type detection

When ingesting images from unfiltered social media streams, it is important to automatically detect different disaster types those images show. For instance, an image can depict a wildfire, flood, earthquake, hurricane, and other types of disasters. In the literature, disaster types have been defined in different hierarchical categories such as natural, man-made, and hybrid Shaluf (2007). Natural disasters are events that result from natural phenomena (e.g., fire, flood, earthquake). Man-made disasters are events that result from human actions (e.g., terrorist attack, accidents, war, and conflicts). Hybrid disasters are events that result from human actions, which effect natural phenomena (e.g., deforestation results in soil erosion, and climate change). The class labels include (i) earthquake, (ii) fire, (iii) flood, (iv) hurricane, (v) landslide, (vi) other disaster – to cover all other disaster types (e.g., plane crash), and (vii) not disaster – for images that do not show any identifiable disasters.

3.1.2 Informativeness

Images posted on social media during disasters do not always contain informative (e.g., image showing damaged infrastructure due to flood, fire or any other disaster events) or useful content for humanitarian aid. It is necessary to remove any irrelevant or redundant content to facilitate crisis responders’ efforts more effectively. Therefore, the purpose of this classification task is to filter irrelevant images. The class labels for this task are (i) informative and (ii) not informative.

3.1.3 Humanitarian

An important aspect of crisis responders is to assist people based on their needs, which requires information to be classified into more fine-grained categories to take specific actions. In the literature, humanitarian categories often include affected individuals; injured or dead people; infrastructure and utility damage; missing or found people; rescue, volunteering, or donation effort; and vehicle damage Alam et al. (2018). In this study, we focus on four categories that are deemed to be the most prominent and important for crisis responders such as (i) affected, injured, or dead people, (ii) infrastructure and utility damage, (iii) rescue volunteering or donation effort, and (iv) not humanitarian.

3.1.4 Damage severity

Assessing the severity of the damage is important to help the affected community during disaster events. The severity of damage can be assessed based on the physical destruction to a built-structure visible in an image (e.g., destruction of bridges, roads, buildings, burned houses, and forests). Following the work reported in Nguyen et al. (2017), we define the categories for this classification task as (i) severe damage, (ii) mild damage, and (iii) little or none.

Figure 2 shows an example image that illustrates the labels for all four tasks.

Figure 2: An image annotated as (i) fire event, (ii) informative, (iii) infrastructure and utility damage, and (iv) severe damage.

3.2 Datasets

As mentioned earlier, we used the dataset reported in Alam et al. (2020).555 Dataset has been developed by curating existing publicly available sources, created non-overlapping train/dev/test splits and made them available. For the sake of clarity and completeness we provide a brief overview of the dataset. More details of the dataset curation and consolidation process can be found in Alam et al. (2020).

3.2.1 Damage Assessment Dataset (DAD)

The damage assessment dataset consists of labeled imagery data with damage severity levels such as severe, mild, and little-to-no damage Nguyen et al. (2017). The images have been collected from two sources: AIDR Imran et al. (2014) and Google. To crawl data from Google, authors used the following keywords: damage building, damage bridge, and damage road. The images from AIDR were collected from Twitter during different disaster events such as Typhoon Ruby, Nepal Earthquake, Ecuador Earthquake, and Hurricane Matthew. The dataset contains images annotated by paid-workers as well as volunteers. In this study, we use this dataset for the informativeness and damage severity tasks. For the informativeness task, the study in Alam et al. (2020) mapped the mild and severe images into informative class and manually categorized the little-to-no damage images into informative and not informative categories. For the damage severity task, the label little-to-no damage mapped into little or none to align with other datasets.

3.2.2 CrisisMMD

This is a multimodal (i.e., text and image) dataset, which consists of images collected from tweets during seven disaster events crawled by the AIDR system Alam et al. (2018). The data is annotated by crowd-workers using the Figure-Eight platform666Currently acquired by for three different tasks: (i) informativeness with binary labels (i.e., informative vs. not informative), (ii) humanitarian with seven class labels (i.e., infrastructure and utility damage, vehicle damage, rescue, volunteering, or donation effort, injured or dead people, affected individuals, missing or found people, other relevant information and not relevant), (iii) damage severity assessment with three labels (i.e., severe, mild and little or no damage). For the humanitarian task similar class labels are grouped together. The images with labels injured or dead people and affected individuals are mapped into one class label affected, injured, or dead people; infrastructure and utility damage and vehicle damage are mapped into infrastructure and utility damage; other relevant information, and not relevant are mapped into not humanitarian. The images with label missing or found people are removed as it is difficult to identify. This results in four class labels for humanitarian task.

3.2.3 AIDR Disaster Type Dataset (AIDR-DT)

AIDR-DT dataset consists of tweets collected from 17 disaster events and 3 general collections. The tweets of these collections have been collected by the AIDR system Imran et al. (2014). The 17 disaster events include flood, earthquake, fire, hurricane, terrorist attack, and armed-conflict. The tweets in general collections contain keywords related to natural disasters, human-induced disasters, and security incidents. Images are crawled from these collections for disaster type annotation. The labeling of these images was performed in two steps. First, a set of images were labeled as earthquake, fire, flood, hurricane, and none of these categories. Then, a sample of 2200 images were selected and labeled as none of these categories in the previous step for annotating not disaster and other disaster categories.

For the landslide category, images are crawled from Google, Bing, and Flickr using keywords landslide, mudslide, “mud slides”, landslip, “rock slides”, rockfall, “land slide”, earthslip, rockslide, and “land collapse”. As images have been collected from different sources, therefore, it resulted in having duplicates. To take this into account, duplicate filtering has been applied to remove exact- and near-duplicate images. Then, the remaining images were manually labeled as landslide and not landslide. The resulted annotated dataset consists of labeled images with seven categories as mentioned in Section 3.1.1.

3.2.4 Damage Multimodal Dataset (DMD)

The multimodal damage identification dataset consists of 5,878 images collected from Instagram and Google Mouzannar et al. (2018)

. Authors of the study crawled the images using more than 100 hashtags, which are proposed in crisis lexicon

Olteanu et al. (2014). The manually labeled data consist of six damage class labels such as fires, floods, natural landscape, infrastructural, human, and non-damage. The non-damage image includes cartoons, advertisements, and images that are not relevant or useful for humanitarian tasks. The study by Alam et al. (2020) re-labeled images for all four tasks disaster type, informativeness, humanitarian, and damage severity tasks using the same class labels discussed in the previous section.

Dataset Class labels Train Dev Test Total
Earthquake 1910 201 376 2487
Fire 990 105 214 1309
Flood 2059 241 533 2833
Hurricane 1188 142 279 1609
Landslide 901 119 257 1277
Not disaster 1507 198 415 2120
Other disaster 65 6 17 88
Total 8620 1012 2091 11723
Earthquake 130 17 35 182
Fire 255 36 71 362
Flood 263 35 70 368
Hurricane 253 36 73 362
Landslide 38 5 11 54
Not disaster 2108 288 575 2971
Other disaster 1057 145 287 1489
Total 4152 506 1130 5788
Table 1: Data split for the disaster types task.
Dataset Class labels Train Dev Test Total
Informative 15329 590 2266 18185
Not informative 5950 426 1259 7635
Total 21279 1016 3525 25820
Informative 7233 635 1507 9375
Not informative 6535 551 1621 8707
Total 13768 1186 3128 18082
Informative 2071 262 573 2906
Not informative 2152 240 580 2972
Total 4223 502 1153 5878
Informative 627 66 172 865
Not informative 6677 598 1796 9071
Total 7304 664 1968 9936
Table 2: Data split for the informativeness task.
Dataset Class labels Train Dev Test Total
Affected, injured, or dead people 521 51 100 672
Infrastructure and utility damage 3040 299 589 3928
Not humanitarian 3307 296 807 4410
Rescue volunteering or donation effort 1682 174 375 2231
Total 8550 820 1871 11241
Affected, injured, or dead people 242 28 63 333
Infrastructure and utility damage 933 125 242 1300
Not humanitarian 2736 314 744 3794
Rescue volunteering or donation effort 74 9 18 101
Total 3985 476 1067 5528
Table 3: Data split for the humanitarian task.
Dataset Class labels Train Dev Test Total
Little or none 7881 1101 1566 10548
Mild 2828 388 546 3762
Severe 9457 673 1380 11510
Total 20166 2162 3492 25820
Little or none 317 35 67 419
Mild 547 56 125 728
Severe 1629 144 278 2051
Total 2493 235 470 3198
Little or none 2874 331 778 3983
Mild 508 60 132 700
Severe 857 110 228 1195
Total 4239 501 1138 5878
Table 4: Data split for the damage severity task.
Class labels Train Dev Test Total
Disaster types
Earthquake 2058 207 404 2669
Fire 1270 121 280 1671
Flood 2336 266 599 3201
Hurricane 1444 175 352 1971
Landslide 940 123 268 1331
Not disaster 3666 435 990 5091
Other disaster 1132 143 302 1577
Total 12846 1470 3195 17511
Informative 26486 1432 3414 31332
Not informative 21700 1622 5063 28385
Total 48186 3054 8477 59717
Affected, injured, or dead people 772 73 160 1005
Infrastructure and utility damage 4001 406 821 5228
Not humanitarian 6076 578 1550 8204
Rescue volunteering or donation effort 1769 172 391 2332
Total 12618 1229 2922 16769
Damage severity
Little or none 11437 1378 2135 14950
Mild 4072 489 629 5190
Severe 12810 845 1101 14756
Total 28319 2712 3865 34896
Table 5: Data splits for the consolidated dataset for all tasks.

3.3 Data Split

Before consolidating the datasets, each dataset has been divided into train, dev, and test sets with 70:10:20 ratio, respectively. The purpose was threefold: (i) train and evaluate individual datasets on each task, (ii) have a close-to-equal distribution from each dataset into the final consolidated dataset, and (iii) provide the research community an opportunity to use the splits independently. After data split, duplicate images are identified across sets and move them into the training set to create a non-overlapping test set.

3.4 Data Consolidation

One of the important reasons to perform data consolidation is to develop robust deep learning models with large amounts of data. For this purpose, all train, dev, and test sets are merged into the consolidated train, dev, and test sets, respectively. While doing so duplicate images are identified in dev and test sets, then moved into train set to create non-overlapping sets for different tasks. More detail of the duplicate identification process can be found in Alam et al. (2020).

3.5 Data Statistics

Tables 1, 2, 3, 4, and 5

show the label distribution of all datasets for all four tasks. Some class labels are skewed in individual datasets. For example, in disaster type datasets (Table

1), the distribution of “other disaster” label is low in AIDR-DT dataset, whereas the distribution of “landslide” label low in DMD dataset. For the informativeness task, low distribution is observed for the “informative” label. Moreover, for the humanitarian task, we have low distribution for “rescue volunteering or donation effort” label in DMD dataset, and for the damage severity task “mild” label in CrisisMMD and DMD datasets. However, the consolidated dataset creates a fair balance across class labels for different tasks as shown in Table 5.

4 Experiments

Our experiments consists of (i) individual vs. consolidated datasets comparison (RQ1), (ii) network architectures comparison (RQ2) on the consolidated datasets, (iii) data augmentation (RQ3), (iv) semi-supervised approach (RQ3), and (iv) multitask learning (RQ4). Below we first provide experimental setting then we discuss different experiments that we conducted for this study.

4.1 Experimental Settings

We employ the transfer learning approach to perform experiments, which has shown promising results for various visual recognition tasks in the literature Yosinski et al. (2014); Sharif Razavian et al. (2014); Ozbulak et al. (2016); Oquab et al. (2014)

. The idea of the transfer learning approach is to use existing weights of a pre-trained model. For this study, we used several neural network architectures using the PyTorch library.

777 The architectures include ResNet18, ResNet50, ResNet101 He et al. (2016), AlexNet Krizhevsky et al. (2012), VGG16 Simonyan and Zisserman (2014), DenseNet Huang et al. (2017), SqueezeNet Iandola et al. (2016), InceptionNet Szegedy et al. (2016), MobileNet Howard et al. (2017), and EfficientNet Tan and Le (2019).

We use the weights of the networks trained using ImageNet 

Deng et al. (2009)

to initialize our model. We adapt the last layer (i.e., softmax layer) of the network according to the particular classification task at hand instead of the original 1,000-way classification. The transfer learning approach allows us to transfer the features and the parameters of the network from the broad domain (i.e., large-scale image classification) to the specific one, in our case four different classification tasks. We train the models using the Adam optimizer 

Kingma and Ba (2015) with an initial learning rate of

, which is decreased by a factor of 10 when accuracy on the dev set stops improving for 10 epochs. The models were trained for 150 epochs.

We designed the binary classifier for the informativeness task and multiclass classifiers for other tasks.

To measure the performance of each classifier, we use weighted average precision (P), recall (R), and F1-measure (F1). We only report F1-measure due to limited space.

4.2 Datasets Comparison

To determine whether consolidated data helps achieve better performance, we train the models using training sets from the individual and consolidated datasets. However, we always test the models on the consolidated test set. As our test data is the same across different experiments, results are ensured to be comparable. Since we have four different tasks, which consist of fifteen different datasets, we only experimented with the ResNet18 He et al. (2016) network architecture to manage the computational load.

4.3 Network Architectures

Currently available neural network architectures come with different computational complexity. As one of our goals is to deploy the models in real-time applications, we exploit them to understand their performance differences. Another motivation is that current literature in crisis informatics only reports results using one or two network architectures (e.g., VGG16 in Ofli et al. (2020), InceptionNet in Mouzannar et al. (2018)), which we wanted to extend in this study.

4.4 Data Augmentation

Data augmentation is a commonly used technique to improve the generalization of deep neural networks in the absence of large-scale datasets. We experiment with the recently proposed RandAugment Cubuk et al. (2020) method for image augmentation. In literature, RandAugment was proposed as a fast alternative for learned augmentation strategies. We used the PyTorch implementation888 in our experiments. To increase the diversity of generated examples we used the following 16 different transformations:

  1. AutoContrast

  2. Equalize

  3. Invert

  4. Rotate

  5. Color

  6. Posterize

  7. Solarize

  8. SolarizeAdd

  9. Contrast

  10. Brightness

  11. Sharpness

  12. ShearX

  13. ShearY

  14. CutoutAbs

  15. TranslateX

  16. TranslateY

The augmentation strengths can be controlled with two tunable parameters:

  1. : the number of augmentation transformations to apply sequentially

  2. : magnitude for all the transformations.

Each transformation resides on an integer scale from 0 to 30, with 30 being the maximum strength. In our experiments, we use constant magnitude for all augmentations. The augmentation method then boils down to randomly selecting transformations and applying each transformation sequentially with strength corresponding to scale .

In addition, we used weight decay, which is one of the most commonly used techniques for regularizing parametric machine learning models Moody et al. (1995). This helps to reduce the overfitting of the models and avoids exploding gradient.

We have conducted the data augmentation experiments using all nine different neural network architectures. We used a weight decay of and other hyper-parameters remain the same as discussed in Section 4.1.

4.5 Semi-supervised Learning

State of the art image classification models is often trained with a large amount of labeled data, which is prohibitively expensive to collect in many applications. Semi-supervised learning is a powerful approach to mitigate this issue and leverage unlabeled data to improve the performance of machine learning models. Since unlabeled data can be obtained without significant human labor, performance boost gained from semi-supervised learning comes at low cost and can be scaled easily. In literature many semi-supervised techniques has been proposed focusing on deep learning Xie et al. (2020); Sohn et al. (2020); Berthelot et al. (2019a, b); Laine and Aila (2016); Lee and others (2013); McLachlan (1975); Sajjadi et al. (2016); Tarvainen and Valpola (2017); Verma et al. (2019); Xie et al. (2019); Alam et al. (2018a). Among them self-training approach is one of the earliest Scudder (1965), which has been adopted for deep neural network. The self-training approach, also called pseudo-labeling Lee and others (2013), uses the model’s prediction as a label and retrain the model against it.

For this study, we use Noisy student (i.e, a simple self-training approach) training, which was proposed in Xie et al. (2020) as a semi-supervised learning approach to improve accuracy and robustness of state of the art image classification models. The algorithm consists of three main steps:

  1. [label=Step 0:,leftmargin=*]

  2. Train a teacher model on labeled images

  3. Use the teacher model to generate pseudo labels on unlabeled images

  4. Train a student model on combined labeled and pseudo labeled images

The algorithm can be iterated multiple times by treating the student as the new teacher and labeling the unlabeled images with it. During the learning of the student, different noises can be injected, such as dropout Srivastava et al. (2014) and data augmentation via RandAugment Cubuk et al. (2020). The student model is made larger than or equal to the teacher. The presence of noise and larger model capacity help the student model generalize better than the teacher.

Labeled dataset:

As for the labeled dataset, we used our consolidated datasets and ran the experiments for all tasks.

Unlabeled dataset:

To obtain unlabeled images, we crawled images from the tweets of the 20 different disaster collections (as mentioned in Section 3.2.3). We removed duplicates and made sure the same images are not in our labeled dataset by matching their ids and applying duplicate filtering. The resulting unlabeled dataset consists of 1514497 images, which we used in our experiments.


We ran our experiments using the EfficientNet (b1) architecture as it was performing better compared to the other models. In addition, it is one of the models used with Noisy student experiments reported in Xie et al. (2020). One significant difference between their work in Xie et al. (2020)

and our work is that we initialize our student model’s weight with ImageNet pretrained weight. In contrast, in

Xie et al. (2020), they initialize weights from scratch. Our labeled dataset is significantly small compared to the ImageNet dataset. As such, in our experiments, training from scratch substantially degrades performance.

Training details:

We first trained the model using the EfficientNet (b1) architecture on the labeled dataset (Step 1), which is referred to as the teacher model. We then predicted output for the unlabeled images (Step 2). We then trained the student EfficientNet (b1) model by combining labeled and pseudo labeled images (Step 3). In this step, for the unlabeled data, we performed different filtering and balancing. We selected the images that have a confidence label greater than a certain task-specific threshold. After this, we balanced the training data so that each class has the same number of images as the class having the lowest number of images. To do this, for each class, we take the images having the highest confidence scores.

For the experiments, we used a batch size of 16 for labeled images and 48 for unlabeled images. Labeled and unlabeled images are concatenated together to compute the average cross-entropy loss. We used RandAugment with the number of augmentation, , and the strength of augmentation, . We optimized the confidence thresholds separately for different tasks using the dev sets. The thresholds for disaster types, informativeness, humanitarian, and damage severity tasks were respectively 0.7, 0.8, 0.45, and 0.45. Similar to the data augmentation experiments we used a weight decay of and kept other hyper-parameters the same as discussed in Section 4.1.

Class labels Train Dev Test Total
Disaster types
Earthquake 1987 218 464 2669
Fire 1115 154 402 1671
Flood 2175 300 726 3201
Hurricane 1249 216 506 1971
Landslide 917 127 287 1331
Not disaster 3064 564 1463 5091
Other disaster 489 218 870 1577
Total 10996 1797 4718 17511
Informative 22018 2736 6578 31332
Not informative 18841 2460 7084 28385
Total 40859 5196 13662 59717
Affected injured or dead people 537 115 353 1005
Infrastructure and utility damage 2397 736 2095 5228
Not humanitarian 4354 886 2964 8204
Rescue volunteering or donation effort 1312 268 752 2332
Total 8600 2005 6164 16769
Damage Severity
Little or none 9124 1677 4149 14950
Mild 3188 663 1339 5190
Severe 11102 1145 2509 14756
Total 23414 3485 7997 34896
Table 6: Data split for multi-task setting with incomplete/missing labels. DS: Disaster types, Info: Informative, Hum: Humanitarian, DS: Damage Severity
Two tasks: Info and Hum
Class labels Train Dev Test Total
Informative 2111 399 1064 3574
Not informative 2546 397 1443 4386
Total 4657 796 2507 7960
Affected injured or dead people 426 72 166 664
Infrastructure and utility damage 410 81 210 701
Not humanitarian 2547 397 1443 4387
Rescue volunteering or donation effort 1274 246 688 2208
Total 4657 796 2507 7960
Two tasks: Info and damage severity
Informative 14683 1306 2206 18195
Not informative 4687 928 2020 7635
Total 19370 2234 4226 25830
Damage Severity
Little or none 7085 1094 2369 10548
Mild 2665 426 679 3770
Severe 9620 714 1178 11512
Total 19370 2234 4226 25830
Table 7: Data split for multi-task setting with complete aligned labels for the different combinations of two-tasks.
Class labels Train Dev Test Total
Disaster types
Earthquake 68 25 90 183
Fire 80 35 155 270
Flood 102 54 162 318
Hurricane 110 75 214 399
Landslide 8 6 24 38
Not disaster 1563 368 1043 2974
Other disaster 372 198 806 1376
Total 2303 761 2494 5558
Informative 740 393 1454 2587
Not informative 1563 368 1040 2971
Total 2303 761 2494 5558
Affected injured or dead people 85 34 164 283
Infrastructure and utility damage 398 230 764 1392
Not humanitarian 1794 483 1513 3790
Rescue volunteering or donation effort 26 14 53 93
Total 2303 761 2494 5558
Damage Severity
Little or none 1805 494 1571 3870
Mild 174 102 337 613
Severe 324 165 586 1075
Total 2303 761 2494 5558
Table 8: Data split for multi-task setting with complete aligned labels for four-tasks: DS, Info, Hum and DS.

4.6 Multi-task Learning

Since the tasks share similar properties, we also consider training the model in multi-task settings with shared parameters. The benefits of multi-task settings can be twofold: (i) learning shared representation can help the model generalize better and improve performance on individual tasks, and (ii) training a single model instead of four different models will yield a significant speed and reduce computational load during training and inference. It is important to mention that the Crisis Benchmark Dataset was not designed for multitask learning rather it was prepared for each task separately. Hence, we needed to prepare them for the multitask setup. Creating multitask learning datasets from Crisis Benchmark Dataset introduced a challenge – there is a overlap between train and test set images among different tasks. Hence, we prepare the datasets for the multitask setting using the following strategy:

  1. We merge the test sets from different tasks into a combined test set. If an image in the combined test set is present in the train or dev set of some tasks, we remove it from that split and add the label in the test set.

  2. We merge the dev sets of the four tasks into the combined dev set. If an image in the combined dev set is present in the train set of some tasks, we remove it from that train split and add the label in the dev set.

  3. We merge the train sets of the four tasks into the combined train set. Since we have removed images that overlap with the dev set and test set in the previous steps, this guarantees no image from the train set will be present in the other splits.

Since all images do not have annotation for all four tasks, there is a discrepancy in the number of images available for different tasks. We report the distribution of the data splits for the multi-task setting In Table 6. Overall, there are 49353 images in the train set, 6157 images in the dev set, and 15688 images in the test set. Due to the overlap of images in different splits for different tasks, there is also a discrepancy between the number of images available between multi-task and single-task settings. As an example, for the disaster types task, there are 12846 images in the train set, 1470 images in the dev set, and 3195 images in the test set in the single-task setting. However, in the multi-task setting, these numbers are respectively 10996, 1797, and 4718. As a consequence of our merging procedure, there are more images in the test and dev sets and fewer images in the train set.

Few approaches have been proposed in the literature to address the issue of incomplete/missing labels in multi-task settings. They usually work by generating missing task labels using different methods, including Bayesian networks

Kapoor et al. (2012), rule-based approach Kollias and Zafeiriou (2019), knowledge distillation from another model Deng et al. (2020). In our experiments, we opt for a simpler alternative. Specifically, we do not compute loss for a task if its label is missing. Since the tasks have varying training images, we calculate the loss for each task and aggregate them in a batch. This ensures that the loss of each task is weighted equally. The process is detailed in Algorithm 1.

Input: batch_input   // images in the batch
                batch_labels  // list of labels for each task
                num_classes   // number of classes for each task
                model   // outputs prediction for all tasks are combined
Output: batch_loss
num_tasks = (num_classes) prediction = (batch_input) batch_loss = 0 task_index = 0   // starting index for output corresponding to this task
for  to  do
       prediction_task = prediction[:, task_index:task_index + num_classes[i]] label_task = batch_labels[i] /* if there is no label for a task it is marked as -1 in the label */
       valid_idx = (label_task != -1) task_loss = (prediction_task[valid_idx], label_task[valid_idx]) batch_loss = batch_loss + task_loss task_index = task_index + num_classes[i]
Algorithm 1 Batch loss calculation in the multi-task setting

We also experiment with images having completely aligned labels for different tasks. We identified three such combinations that have a substantial number of images in different classes. Two of them belong to two task subsets. The first one is informativeness and humanitarian, which has 7960 total aligned images. The second one is informativeness and damage severity, having 25830 total images. Data distribution for these two settings is reported in Table 7. The final subset of images having labels for all four tasks, which consists of 5558 images. Data distribution for this set is reported in Table 8.

5 Results

Our experimental results consist of different settings. Below we discuss each of them in details.

Dataset Acc P R F1
Disaster types (7 classes)
AIDR-DT 0.76 0.72 0.76 0.73
DMD 0.58 0.73 0.58 0.59
Consolidated 0.79 0.78 0.79 0.79
Informativeness (2 classes)
DAD 0.80 0.80 0.80 0.80
CrisisMMD 0.79 0.79 0.79 0.79
DMD 0.80 0.80 0.80 0.80
AIDR-Info 0.75 0.79 0.75 0.73
Consolidated 0.85 0.85 0.85 0.85
Humanitarian (4 classes)
CrisisMMD 0.73 0.73 0.73 0.73
DMD 0.68 0.68 0.68 0.64
Consolidated 0.75 0.75 0.75 0.75
Damage severity (3 classes)
DAD 0.72 0.70 0.72 0.71
CrisisMMD 0.41 0.57 0.41 0.37
DMD 0.68 0.66 0.68 0.66
Consolidated 0.75 0.73 0.75 0.74
Table 9: Results on different classification tasks using the ResNet18 model. Trained on individual and consolidated datasets and tested on consolidated test sets.

5.1 Dataset Comparison

In Table 9, we report classification results for different tasks and different datasets using ResNet18 network architecture. The performance of different tasks is not equally comparable as they have different levels of complexity (e.g., varying number of class labels, class imbalance, etc.). For example, the informativeness classification is a binary task, which is computationally simpler than a classification task with more labels (e.g., seven labels in disaster types). Hence, the performance is comparatively higher for informativeness. An example of a class imbalance issue can be seen in Table 5 with the damage severity task. The distribution of mild is comparatively small, which reflects on its and overall performance. The mild class label is also less distinctive than other class labels, and we noticed that classifiers often confuse this class label with the other two class labels. Similar findings have also been reported in Nguyen et al. (2017). For the disaster types task, the performance of the AIDR-DT model is higher compared to the DMD model. We observe that the DMD dataset is comparatively small and the model is not performing well on the consolidated dataset. This characteristic is observed in other tasks as well. For the damage severity task, CrisisMMD is performing worse, which is also reflected on its dataset size, i.e., 2493 images in the training set as can be seen in Table 4. As expected, overall for all tasks, the models with the consolidated datasets outperform individual datasets.

Arch Acc P R F1 Acc P R F1
Disaster types Informative
ResNet18 0.790 0.783 0.790 0.785 0.852 0.851 0.852 0.851
ResNet50 0.810 0.806 0.810 0.808 0.852 0.852 0.852 0.852
ResNet101 0.817 0.812 0.817 0.813 0.853 0.853 0.853 0.852
AlexNet 0.756 0.756 0.756 0.754 0.827 0.829 0.827 0.828
VGG16 0.800 0.796 0.800 0.798 0.859 0.858 0.859 0.858
DenseNet(121) 0.811 0.805 0.811 0.806 0.863 0.863 0.863 0.862
SqueezeNet 0.757 0.754 0.757 0.755 0.829 0.829 0.829 0.829
InceptionNet (v3) 0.562 0.609 0.562 0.528 0.663 0.723 0.663 0.593
MobileNet (v2) 0.785 0.781 0.785 0.782 0.850 0.849 0.850 0.849
EfficientNet (b1) 0.818 0.815 0.818 0.816 0.864 0.863 0.864 0.863
Humanitarian Damage severity
ResNet18 0.754 0.747 0.754 0.749 0.751 0.734 0.751 0.736
ResNet50 0.770 0.762 0.770 0.762 0.763 0.746 0.763 0.751
ResNet101 0.769 0.763 0.769 0.765 0.760 0.736 0.760 0.737
AlexNet 0.721 0.715 0.721 0.716 0.734 0.714 0.734 0.709
VGG16 0.778 0.773 0.778 0.773 0.769 0.750 0.769 0.753
DenseNet(121) 0.765 0.756 0.765 0.755 0.755 0.734 0.755 0.739
SqueezeNet 0.730 0.717 0.730 0.719 0.733 0.707 0.733 0.708
InceptionNet (v3) 0.598 0.637 0.598 0.509 0.660 0.623 0.660 0.615
MobileNet (v2) 0.751 0.745 0.751 0.746 0.746 0.727 0.746 0.730
EfficientNet (b1) 0.767 0.764 0.767 0.765 0.766 0.754 0.766 0.758
Table 10: Results using different neural network models on the consolidated dataset with four different tasks. Trained and tested using the consolidated dataset. Comparable result is shown in bold and best results is shown in underlined. IncepNet (InceptionNet), MobNet (MobileNet), EffiNet (EfficientNet)
Model # Layer # Param (M) Memory (MB)
ResNet18 18 11.18 74.61
ResNet50 50 23.51 233.54
ResNet101 101 42.50 377.58
AlexNet 8 57.01 222.24
VGG16 16 134.28 673.87
DenseNet (121) 121 6.96 174.2
SqueezeNet 18 0.74 47.99
InceptionNet (v3) 42 24.35 206.01
MobileNet (v2) 20 2.23 8.49
EfficientNet (b1) 25 7.79 177.82
Table 11: Different neural network models with number of layer, parameters and memory requirement during the inference of a binary (Informativeness) classification task.
Figure 3: Average F1 scores from all four tasks with different network architectures, which shows on average EfficientNet (b1) performs better than other architectures.
(a) Disaster types
(b) Informativeness
Figure 4: Statistical significant test among the different network architectures for Disaster types and Informativeness tasks. -values are presented in cells. Light yellow color represent they are statistically significant with
(a) Humanitarian
(b) Damage severity
Figure 5: Statistical significant test among the different network architectures for Humanitarian and Damage severity tasks. -values are presented in cells. Light yellow color represent they are statistically significant with

5.2 Network Architectures Comparison

In Table 10, we report results using different network architectures on consolidated datasets for different tasks, i.e., trained and tested using a consolidated dataset. Across different tasks, overall EfficientNet (b1) is performing better than other models as shown in Figure 3, except for humanitarian task, for which VGG16 is outperforming other models. Comparatively the second-best models are VGG16, ResNet50, ResNet101, and DenseNet (101). From the results of different tasks, we observe that InceptionNet (v3) is the worst performing model.

The performance difference among different models such as EfficientNet (b1), VGG16, ResNet50, ResNet101, and DenseNet (101) are low, hence, we have done statistical test to understand whether such small differences are significant. We used McNemar’s test for binary classification task, (i.e., informativeness) and Bowker’s test for other multiclass classification tasks. More details of this test can be found in Hoffman (2019). We have done such tests between two models to see pair-wise difference. In Figure 4 and 5, we report the results of significant test. The value in the cell represent the -value and the light yellow color represent they are statistically significant with . From the Figure 4, we see that for disaster types task the -value is higher than in comparison between EfficientNet (b1) vs. ResNet50, ResNet101 and DenseNet (121), which clearly shows among the results in Table 10. Similarly the difference is very low between EfficientNet (b1) vs. VGG16 and DenseNet (121). For humanitarian and damage severity tasks, we observed similar behaviors. By analyzing all four tasks it appears VGG16 is the second best performing model.

In Table 11, we also report different neural network models with their number of layers, parameters, and memory consumption during the inference of informativeness task. There can always be a trade-off between performance vs. computational complexity, i.e., number of layers, parameters, and memory consumption. In terms of memory consumption and the number of parameters, VGG16 seems expensive than others. Based on the performance and computational complexity, we can conclude that EfficientNet can be the best option to use in real-time applications. We computed throughput for EfficientNet using a batch size of 128 and it can process 260 images per second on an NVIDIA Tesla P100 GPU. Among different ResNet models, ResNet18 is a reasonable choice given that its computational complexity is significantly less than other ResNet models.

Arch Acc P R F1 Diff. Acc P R F1 Diff.
Disaster types Informative
ResNet18 0.812 0.807 0.812 0.809 2.4 0.848 0.847 0.848 0.847 -0.4
ResNet50 0.817 0.81 0.817 0.812 0.4 0.863 0.863 0.863 0.862 1.0
ResNet101 0.819 0.815 0.819 0.816 0.3 0.857 0.858 0.857 0.858 0.6
AlexNet 0.755 0.753 0.755 0.753 -0.1 0.827 0.826 0.827 0.825 -0.3
VGG16 0.803 0.797 0.803 0.798 0.0 0.855 0.855 0.855 0.855 -0.3
DenseNet (121) 0.817 0.811 0.817 0.813 0.7 0.858 0.858 0.858 0.857 -0.5
SqueezeNet 0.726 0.719 0.726 0.717 -3.8 0.821 0.820 0.821 0.820 -0.9
InceptionNet (v3) 0.808 0.801 0.808 *0.802 25.4 0.860 0.859 0.860 *0.859 33.1
MobileNet (v2) 0.793 0.788 0.793 0.789 0.7 0.854 0.853 0.854 0.853 0.4
EfficientNet (b1) 0.838 0.834 0.838 0.835 1.9 0.869 0.868 0.869 0.868 0.5
Humanitarian Damage severity
ResNet18 0.745 0.738 0.745 0.741 -0.8 0.757 0.736 0.757 0.739 0.3
ResNet50 0.774 0.769 0.774 0.768 0.6 0.763 0.745 0.763 0.749 -0.2
ResNet101 0.774 0.778 0.774 0.775 1 0.766 0.753 0.766 0.757 2.0
AlexNet 0.718 0.709 0.718 0.709 -0.7 0.728 0.712 0.728 0.713 0.4
VGG16 0.772 0.766 0.772 0.767 -0.6 0.767 0.748 0.767 0.752 -0.1
DenseNet (121) 0.759 0.756 0.759 0.755 0 0.760 0.741 0.760 0.747 0.8
SqueezeNet 0.720 0.713 0.720 0.712 -0.7 0.729 0.708 0.729 0.702 -0.6
InceptionNet (v3) 0.762 0.753 0.762 *0.754 25.6 0.758 0.735 0.758 *0.739 11.5
MobileNet (v2) 0.759 0.749 0.759 0.751 0.5 0.758 0.737 0.758 0.738 0.8
EfficientNet (b1) 0.785 0.784 0.785 0.784 1.9 0.777 0.762 0.777 *0.765 0.7
Table 12: Results with data augmentation and weight decay using different neural network models on the consolidated dataset for all four tasks. Diff. represents the difference without RandAugment results presented in Table 10. * represents statistically significant (with ) compared to the without RandAugment results.
(a) Without RandAugment
(b) With RandAugment and weight decay
Figure 6: Training/validation losses and accuracies without and with augmentation for Informativeness task.
(a) Without RandAugment
(b) With RandAugment and weight decay
Figure 7: Training/validation losses and accuracies without and with augmentation for Humanitarian task.

5.3 Data Augmentation

To reduce the overfitting and to have more generalized models, we used data augmentation and weight decay. In Table 12, we report the results for all tasks and using all network architectures. The column Diff. report the difference between the results presented in Table 10 where no RandAugment or weight decay has been applied. The improved results are highlighted with light blue color for all tasks. Out of 40 experiments (10 network architectures ✕  4 tasks), for 26 cases, the augmentation with weight decay improved the performances.

On the improved cases, we also computed a statistical significance test between no RandAugment and RandAugment with with weight decay models. We found that the improvements for the models with InceptionNet (v3) are statistically significant in all tasks. Only the improved performance with EfficientNet (b1) for damage severity task is statistically significant, and for other tasks, they are not statistically significant. We investigated training and validation losses over the number of epochs. In Figure 6 and 7, we report training, validation losses and accuracies for EfficientNet (b1) model for Informativeness and Humanitarian tasks, respectively. From the figures 5(a) and 6(a), we clearly see that models are overfitting, whereas figures 5(b) and 6(b) show that models are more generalized. These findings demonstrate the benefits of augmentation and weight decay.

5.4 Semi-supervised Learning

In Table 13, we present the results of Noisy student based self-training approach along without/with RandAugment results. We have an improvement for the Informativeness task. For the Humanitarian task, the performance is similar to RandAugment. For the Damage severity task, the performance of Noisy student is same as without RandAugment but lower than RandAugment.

We postulate following possible reasons for lack of improvements in semi-supervised learning experiments:

  1. Semi-supervised learning usually performs better when trained from scratch instead of fine-tuning from a pretrained model. This phenomenon is explored in Zhou et al. (2018) where the authors reported the performance gained from semi-supervised learning methods are usually smaller when trained from a pretrained model. We could not train the student model from scratch as our labeled datasets are small, and it degrades performance even more.

  2. We had to use a much smaller labeled batch size of 16 compared to those used in Xie et al. (2020) (512 or higher) due to GPU constraints. Having a larger labeled batch size and, consequently, more unlabeled images in each batch may yield a better result.

Exp. Acc P R F1
Disaster type
Without RandAugment 0.818 0.815 0.818 0.816
RandAugment 0.838 0.834 0.838 0.835
NS 0.793 0.812 0.793 0.794
Without RandAugment 0.864 0.863 0.864 0.863
RandAugment 0.869 0.868 0.869 0.868
NS 0.878 0.878 0.878 0.876
Without RandAugment 0.767 0.764 0.767 0.765
RandAugment 0.785 0.784 0.785 0.784
NS 0.783 0.786 0.783 0.783
Damage severity
Without RandAugment 0.766 0.754 0.766 0.758
RandAugment 0.777 0.762 0.777 0.765
NS 0.773 0.753 0.773 0.759
Table 13: Results with Noisy student self-training approach using Efficient (b1) neural network models on the consolidated datasets for all four tasks. NS: Noisy student
Task Acc P R F1
Disaster type 0.647 0.657 0.647 0.637
Informativeness 0.727 0.735 0.727 0.726
Humanitarian 0.775 0.772 0.775 0.773
Damage sevirity 0.744 0.732 0.744 0.737
Table 14: Results of multitask learning with incomplete/missing labels.
Task Acc P R F1
Two tasks: Info and DS
Informative 0.855 0.856 0.855 0.855
Damage Severity 0.806 0.799 0.806 0.802
Two tasks: Info and Hum
Informative 0.817 0.816 0.817 0.816
Humanitarian 0.761 0.756 0.761 0.758
Four tasks: DT, Info, Hum and DS
Disaster types 0.781 0.768 0.781 0.772
Informative 0.920 0.921 0.920 0.920
Humanitarian 0.827 0.807 0.827 0.816
Damage Severity 0.772 0.750 0.772 0.759
Table 15: Results of multitask learning with different tasks combinations and complete labels. DT: Disaster Types, Info: Informative, Hum: Humanitarian, DS: Damage Severity.

5.5 Multi-task Learning

Since the Crisis Benchmark Dataset has not designed to address the multitask learning, therefore, we needed to re-split them as discussed in Section 4.6. This resulted two different settings: (i) incomplete/missing labels, and (ii) complete aligned labels. The incomplete/missing labels in multitask learning is a challenging problem, which we addressed using masking, i.e., for an unlabeled output we are not computing loss for that particular task. In Table 14, we report the results of multitask learning with missing labels where we address all tasks. We also investigated different tasks combinations where all labels are present. In Table 15, we report the results of different tasks combinations where they have complete aligned labels. For different task combinations performances differ due to their data sizes, label distribution and task settings. The results with multitask learning is not exactly comparable with our single task setup. They can serve as a baseline for future studies.

Figure 8: GradCAM visualization of some images for the informativeness task.
Figure 9: Grad-CAM visualization of some images for the disaster types task.

5.6 Visual Explanation using Grad-CAM

We explore how the neural networks arrive at their decision by utilizing Gradient-weighted Class Activation Mapping (Grad-CAM) Selvaraju et al. (2017). Grad-CAM uses the gradient of a target class flowing into the final convolution layer to produce a localization map highlighting the important regions in the image for that specific class. We use the implementation provided in 999 We display results for two candidate networks: VGG16 and EfficientNet on two tasks: informativeness and disaster types. We use the models trained using RandAugment for this experiment.

In Figure 8, we show the activation map for the predicted class for some images from the informativeness test set. From these images, it seems that EfficientNet performs better for localizing important regions in the image for the class of interest. VGG16 tends to depend on smaller regions for decision making. The last row shows an image where VGG16 misclassified an informative image as not informative.

We show the activation map for some images from the test set of the disaster types task in Figure 9. Here, the difference in localization quality between the two models is even more pronounced. The activation maps from VGG are difficult to interpret in the first and third images, even though the model classifies them correctly. The second image shows that VGG may focus on the smoke regions for classifying fire images. This explains why it identifies the last image as fire, mistaking the clouds as smoke.

Overall, these results suggest that EfficientNet not only outperforms other models in the numeric measures, it also produces results that are easier to interpret.

6 Discussions and Future Works

6.1 Our Findings

Real-time event detection is an important problem from social media content. Our proposed pipeline and models are suitable to deploy them real-time applications. The proposed models can also be used independently. For example, disaster types model can be used to monitor real-time disaster events.

Our experiments were based on the research questions discussed in Section 1 below we report our findings based on them.

RQ1: Our investigation to dataset comparison suggests that data consolidation helps, which answers our first research question.

RQ2: We also explore several deep learning models, which vary with performance and complexities. Among them, EfficientNet (b1) appears to be a reasonable option. Note that EfficientNet (b1) has a series of network architectures (b0-b7) and for this study, we only reported results with EfficientNet (b1). We aim to further explore other architectures. A small and low latency model is desired to deploy mobile and handheld embedded computer vision applications. The development of MobileNet Howard et al. (2017) sheds light towards that direction. Our experimental results suggest that it is computationally simpler and provides a reasonable accuracy, only 2-3% lower than the best models for different tasks. These findings answer out second research question.

RQ3: We observe that strong data augmentation can improve performance, although this is not consistent across different tasks and models. Semi-supervised learning does not usually yield performance when trained using pretrained models and can sometimes even degrade it.

RQ4: Multi-task learning can be an ideal solution for the real-time system as it can potentially provide speed-ups of multiple factors during inference. However, some tasks may perform worse than their single task settings in the presence of incomplete labels. Having aligned complete labels for different tasks can mitigate this issue.

Ref. Dataset # image # C Cls. Task Models Data Split Acc P R F1
Ofli et al. (2020) CrisisMMD 12,708 2 B Info VGG16 Train/dev/test 0.833 0.831 0.833 0.832
Ofli et al. (2020) CrisisMMD 8,079 5 M Hum VGG16 Train/dev/test 0.768 0.764 0.768 0.763
Mouzannar et al. (2018) DMD 5879 6 M Event Incep 4 folds CV 0.840 - - -
Agarwal et al. (2020) CrisisMMD 18,126 2 B Info Incep 5 folds CV - 0.820 0.820 0.820
Agarwal et al. (2020) CrisisMMD 18,126 2 B Infra. Incep 5 folds CV - 0.920 0.920 0.920
Agarwal et al. (2020) CrisisMMD 18,126 3 B Severity Incep 5 folds CV - 0.950 0.940 0.940
Abavisani et al. (2020) CrisisMMD 11,250 2 B Info DenseNet Train/dev/test 0.816 - - 0.812
Abavisani et al. (2020) CrisisMMD 3,359 5 B Hum DenseNet Train/dev/test 0.834 - - 0.870
Abavisani et al. (2020) CrisisMMD 3,288 3 B Severity DenseNet Train/dev/test 0.629 - - 0.661
Table 16: Recent relevant reported results in the literature. # C: Number of class labels, Cls: Classification task, B: Binary, M: Multiclass, Incep: InceptionNet (v4), Info: Informativeness, Hum: Humanitarian, Event: Disaster event types, Infra.: Infrastructural damage, Severity: Severity Assessment. We converted some numbers from percentage (reported in the different literature) to decimal for an easier comparison.

6.2 Comparing Previous State-of-art

We compared our results with recent and related previous state-of-the-art results, reported in Table 16. However, it is not possible to have an end-to-end comparison for a few possible reasons: (i) different datasets and sizes – see column second and third in Table 16, (ii) different data splits (train/dev/test vs. Cross Validation (CV) fold) even using same dataset – see column Data Split in the same Table, (iii) different evaluation measures such as weighted P/R/F1-measure (first two rows) Ofli et al. (2020) vs. accuracy (3rd row) Mouzannar et al. (2018) vs. CV fold (4th to 6th rows – unspecified in Agarwal et al. (2020) whether measures are macro, micro or weighted).

Even if they are not exactly comparable, however, we observe that on informativeness and humanitarian tasks, previously reported results (weighted F1) are 0.832 and 0.763, respectively, using the CrisisMMD dataset Ofli et al. (2020). The authors in Mouzannar et al. (2018) reported a test accuracy of for six disaster types tasks using the DMD dataset with a five-fold cross-validation run. The study in Agarwal et al. (2020) report an F1 of 0.820 for informativeness, 0.920 for infrastructure damage, and 0.940 for damage severity. In another study, using the CrisisMMD dataset, authors report weighted-F1 of 0.812 and 0.870 for informativeness and humanitarian tasks, respectively Abavisani et al. (2020). They used a small subset of the whole CrisisMMD dataset in their study. From the Table 16 we observe that the F1 for informativeness task ranges from 0.812 to 0.832 across studies, for humanitarian task it varies from 0.763 to 0.870, and for damage severity it varies from 0.661 to 0.940. Compared to them our best results (weighted F1) for disaster types, informativeness, humanitarian and damage severity are 0.835, 0.876, 0.784, and 0.765, respectively, on the consolidated single task dataset.

6.3 Future Works

As for future work we foresee several interesting research avenues. (i) Exploring more in-depth on semi-supervised learning to leverage a large amount of unlabelled social media data and address the limitations that we highlighted in Section 5.4. We believe addressing such limitations can help to improve the performance of the current models. (ii) In multi-task setup, one possible research direction is to address the problem of incomplete/missing labels, and the other is manually labeling Crisis Benchmark Dataset for incomplete labels for all tasks. Both approaches will allow the community ground for explore multi-task study for real-time social media image classification.

7 Conclusions

The imagery and textual content available on social media have been used by humanitarian organizations in times of disaster events. There has been limited work for disaster response image classification tasks compared to text. In this study, we addressed four tasks such as disaster types, informativeness, humanitarian and damage severity, that are needed for disaster response in real-time. Our experimental results on individual and consolidated datasets suggest that data consolidation helps. We investigated four tasks using different state-of-art neural network architectures and reported the best models. The findings on data augmentation suggest that a more generalized model can be obtained with such approaches. Our investigation on semi-supervised and multitask learning shows new research directions for the community. We also provide some insights of activation maps to demonstrate what class-specific information a network is learning.


Not applicable.

Compliance with ethical standards

Conflict of interest

We have no conflicts of interest or competing interests to declare.

Availability of data and material

The data used in this study are available at


  • M. Abavisani, L. Wu, S. Hu, J. Tetreault, and A. Jaimes (2020) Multimodal categorization of crisis events in social media. In Proc. of CVPR, pp. 14679–14689. Cited by: §2.3, §6.2, Table 16.
  • M. Agarwal, M. Leekha, R. Sawhney, and R. R. Shah (2020) Crisis-dias: towards multimodal damage analysis - deployment, challenges and assessment.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (01), pp. 346–353.
    External Links: Link, Document Cited by: §2.3, §6.2, §6.2, Table 16.
  • K. Ahmad, M. Riegler, K. Pogorelov, N. Conci, P. Halvorsen, and F. De Natale (2017a) Jord: a system for collecting information and monitoring natural disasters by linking social media with satellite imagery. In Proceedings of the 15th International Workshop on Content-Based Multimedia Indexing, pp. 1–6. Cited by: §2.1.
  • S. Ahmad, K. Ahmad, N. Ahmad, and N. Conci (2017b) Convolutional neural networks for disaster images retrieval.. In MediaEval, Cited by: §2.1.
  • F. Alam, M. Imran, and F. Ofli (2017) Image4Act: online social media image processing for disaster response.. In Proc. of ASONAM, pp. 1–4. Cited by: §2.1, §2.1.
  • F. Alam, S. Joty, and M. Imran (2018a) Graph based semi-supervised learning with convolution neural networks to classify crisis related tweets. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 12. Cited by: §4.5.
  • F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi (2020) Deep Learning Benchmarks and Datasets for Social Media Image Classification for Disaster Response. arXiv e-prints, pp. arXiv:2011.08916. External Links: 2011.08916 Cited by: §1, §1, §1, §2.5, §2.5, §3.2.1, §3.2.4, §3.2, §3.4.
  • F. Alam, F. Ofli, and M. Imran (2018) CrisisMMD: multimodal twitter datasets from natural disasters. In Proc. of ICWSM, pp. 465–473 (English). Cited by: §2.5, §3.1.3, §3.2.2.
  • F. Alam, F. Ofli, and M. Imran (2018b) Processing social media images by combining human and machine computing during crises. International Journal of Human–Computer Interaction 34 (4), pp. 311–327. External Links: Document Cited by: §2.1, §2.2.
  • B. Benjamin, H. Patrick, Z. Zhengyu, B. J. de, and B. Damian (2018) The multimedia satellite task at MediaEval 2018: emergency response for flooding events. In MediaEval, Cited by: §2.5.
  • D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel (2019a) ReMixMatch: semi-supervised learning with distribution matching and augmentation anchoring. arXiv preprint arXiv:1911.09785. Cited by: §4.5.
  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel (2019b) MixMatch: a holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249. Cited by: §4.5.
  • B. Bischke, P. Helber, C. Schulze, V. Srinivasan, A. Dengel, and D. Borth (2017) The multimedia satellite task at MediaEval 2017.. In In Proceedings of the MediaEval 2017: MediaEval Benchmark Workshop, Cited by: §2.5.
  • T. Chen, D. Lu, M. Kan, and P. Cui (2013) Understanding and classifying image tweets. In ACM Multimedia, pp. 781–784. Cited by: §2.1.
  • T. Chen and C. Guestrin (2016) XGboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §2.3.
  • E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops

    pp. 702–703. Cited by: §1, §4.4, §4.5.
  • S. Daly and J. Thom (2016) Mining and classifying image posts on social media to analyse fires. In Proc. of ISCRAM, pp. 1–14. Cited by: §2.1.
  • D. Deng, Z. Chen, and B. E. Shi (2020) Multitask emotion recognition with incomplete labels. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pp. 828–835. Cited by: §4.6.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.1.
  • Y. Feng and M. Sester (2018) Extraction of pluvial flood relevant volunteered geographic information (vgi) by deep learning from user generated texts and photos. ISPRS International Journal of Geo-Information 7 (2), pp. 39. Cited by: §2.3.
  • R. Gupta, B. Goodman, N. Patel, R. Hosfelt, S. Sajeev, E. Heim, J. Doshi, K. Lucas, H. Choset, and M. Gaston (2019) Creating xbd: a dataset for assessing building damage from satellite imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.5.
  • S. Z. Hassan, K. Ahmad, A. Al-Fuqaha, and N. Conci (2019) Sentiment analysis from images of natural disasters. In International Conference on Image Analysis and Processing, pp. 104–113. Cited by: §2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. of CVPR, pp. 770–778. Cited by: §4.1, §4.2.
  • J. I.E. Hoffman (2019) Chapter 15 - categorical and cross-classified data: mcnemar’s and bowker’s tests, kolmogorov-smirnov tests, concordance. In Basic Biostatistics for Medical and Biomedical Practitioners (Second Edition), J. I.E. Hoffman (Ed.), pp. 233 – 247. External Links: ISBN 978-0-12-817084-7, Document, Link Cited by: §5.2.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §4.1, §6.1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proc. of CVPR, pp. 4700–4708. Cited by: §4.1.
  • X. Huang, C. Wang, Z. Li, and H. Ning (2019) A visual–textual fused approach to automated tagging of flood-related tweets during a flood event. International Journal of Digital Earth 12 (11), pp. 1248–1264. Cited by: §2.3.
  • F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and 0.5 mb model size. arXiv:1602.07360. Cited by: §4.1.
  • M. Imran, F. Alam, U. Qazi, S. Peterson, and F. Ofli (2020) Rapid damage assessment using social media images by combining human and machine intelligence. arXiv preprint arXiv:2004.06675. Cited by: §2.2.
  • M. Imran, C. Castillo, F. Diaz, and S. Vieweg (2015) Processing social media messages in mass emergency: a survey. ACM Computing Surveys 47 (4), pp. 67. Cited by: §1.
  • M. Imran, C. Castillo, J. Lucas, P. Meier, and S. Vieweg (2014) AIDR: artificial intelligence for disaster response. In Proc. of WWW, pp. 159–162. Cited by: §3.2.1, §3.2.3.
  • R. I. Jony, A. Woodley, and D. Perrin (2019) Flood detection in social media images using visual features and metadata. 2019 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–8. Cited by: §2.1.
  • A. Kapoor, R. Viswanathan, and P. Jain (2012) Multilabel classification using bayesian compressed sensing. Advances in neural information processing systems 25, pp. 2645–2653. Cited by: §4.6.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In Proc. of ICLR, Cited by: §4.1.
  • D. Kollias and S. Zafeiriou (2019) Expression, affect, action unit recognition: aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855. Cited by: §4.6.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.1.
  • P. Kumar, F. Ofli, M. Imran, and C. Castillo (2020) Detection of disaster-affected cultural heritage sites from social media images using deep learning techniques. J. Comput. Cult. Herit. 13 (3). External Links: ISSN 1556-4673, Link, Document Cited by: §2.2.
  • R. Lagerstrom, Y. Arzhaeva, P. Szul, O. Obst, R. Power, B. Robinson, and T. Bednarz (2016) Image classification to support emergency situation awareness. Frontiers in Robotics and AI 3, pp. 54. External Links: Link, Document, ISSN 2296-9144 Cited by: §2.1.
  • S. Laine and T. Aila (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242. Cited by: §4.5.
  • D. Lee et al. (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Vol. 3. Cited by: §1, §4.5.
  • X. Li, D. Caragea, C. Caragea, M. Imran, and F. Ofli (2019) Identifying disaster damage images using a domain adaptation approach. In Proc. of ISCRAM, pp. 633–645. Cited by: §2.1.
  • X. Li, D. Caragea, H. Zhang, and M. Imran (2018) Localizing and quantifying damage in social media images. In Proc. of ASONAM, pp. 194–201. Cited by: §2.1.
  • G. J. McLachlan (1975) Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association 70 (350), pp. 365–369. Cited by: §4.5.
  • J. Moody, S. Hanson, A. Krogh, and J. A. Hertz (1995) A simple weight decay can improve generalization. Advances in neural information processing systems 4 (1995), pp. 950–957. Cited by: §4.4.
  • H. Mouzannar, Y. Rizk, and M. Awad (2018) Damage Identification in Social Media Posts using Multimodal Deep Learning. In Proc. of ISCRAM, pp. 529–543. Cited by: §1, §2.3, §2.4, §2.5, §3.2.4, §4.3, §6.2, §6.2, Table 16.
  • D. T. Nguyen, F. Alam, F. Ofli, and M. Imran (2017) Automatic image filtering on social networks using deep learning and perceptual hashing during crises. In Proc. of ISCRAM, Cited by: §2.1, §2.1.
  • D. T. Nguyen, F. Ofli, M. Imran, and P. Mitra (2017) Damage assessment from social media imagery data during disasters. In Proc. of ASONAM, pp. 1–8. Cited by: §1, §2.1, §2.1, §2.5, §3.1.4, §3.2.1, §5.1.
  • K. R. Nia and G. Mori (2017) Building damage assessment using deep learning and ground-level image data. In 14th Conference on Computer and Robot Vision (CRV), pp. 95–102. Cited by: §2.1.
  • H. Ning, Z. Li, M. E. Hodgson, et al. (2020) Prototyping a social media flooding photo screening system based on deep learning. ISPRS International Journal of Geo-Information 9 (2), pp. 104. Cited by: §2.1.
  • F. Ofli, F. Alam, and M. Imran (2020) Analysis of social media data using multimodal deep learning for disaster response. In Proc. of ISCRAM, Cited by: §2.3, §2.4, §4.3, §6.2, §6.2, Table 16.
  • A. Olteanu, C. Castillo, F. Diaz, and S. Vieweg (2014) CrisisLex: a lexicon for collecting and filtering microblogged communications in crises.. In Proc. of ICWSM, Cited by: §3.2.4.
  • M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2014) Learning and transferring mid-level image representations using convolutional neural networks. In Proc. of CVPR, pp. 1717–1724. Cited by: §2.4, §4.1.
  • G. Ozbulak, Y. Aytar, and H. K. Ekenel (2016) How transferable are cnn-based features for age and gender classification?. In International Conference of the Biometrics Special Interest Group, pp. 1–6. External Links: Document Cited by: §2.4, §4.1.
  • R. Peters and J. Porto de Albuqerque (2015) Investigating images as indicators for relevant social media messages in disaster management. In Proc. of ISCRAM, Cited by: §2.1.
  • S. Pouyanfar, Y. Tao, S. Sadiq, H. Tian, Y. Tu, T. Wang, S. Chen, and M. Shyu (2019) Unconstrained flood event detection using adversarial data augmentation. In IEEE International Conference on Image Processing (ICIP), pp. 155–159. Cited by: §2.1.
  • Y. Rizk, H. S. Jomaa, M. Awad, and C. Castillo (2019) A computationally efficient multi-modal classification approach of disaster-related twitter images. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, New York, NY, USA, pp. 2050–2059. External Links: ISBN 9781450359337, Link, Document Cited by: §2.3.
  • N. Said, K. Ahmad, M. Riegler, K. Pogorelov, L. Hassan, N. Ahmad, and N. Conci (2019) Natural disasters detection in social media and satellite imagery: a survey. Multimedia Tools and Applications 78 (22), pp. 31267–31302. Cited by: §1.
  • M. Sajjadi, M. Javanmardi, and T. Tasdizen (2016) Regularization with stochastic transformations and perturbations for deep semi-supervised learning. arXiv preprint arXiv:1606.04586. Cited by: §4.5.
  • H. Scudder (1965) Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory 11 (3), pp. 363–371. Cited by: §4.5.
  • R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: 5th item, §5.6.
  • I. M. Shaluf (2007) Disaster types. Disaster Prevention and Management: An International Journal. Cited by: §3.1.1.
  • A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In Proc. of CVPR Workshops, pp. 806–813. Cited by: §2.4, §4.1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.
  • K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang, and C. Raffel (2020) FixMatch: simplifying semi-supervised learning with consistency and confidence. In Proceedings of the Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020), Cited by: §4.5.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.. Journal of MLR 15 (1), pp. 1929–1958. Cited by: §4.5.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proc. of CVPR, pp. 2818–2826. Cited by: §4.1.
  • M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. arXiv:1905.11946. Cited by: §4.1.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780. Cited by: §4.5.
  • V. Verma, A. Lamb, J. Kannala, Y. Bengio, and D. Lopez-Paz (2019) Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825. Cited by: §4.5.
  • E. Weber, N. Marzo, D. P. Papadopoulos, A. Biswas, A. Lapedriza, F. Ofli, M. Imran, and A. Torralba (2020) Detecting natural disasters, damage, and incidents in the wild. In European Conference on Computer Vision, pp. 331–350. Cited by: §2.5.
  • J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010)

    SUN database: large-scale scene recognition from abbey to zoo

    In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3485–3492. External Links: Document Cited by: §2.3.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2019) Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848. Cited by: §4.5.
  • Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2020) Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698. Cited by: §4.5, §4.5, §4.5, item 2.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in Neural Information Processing Systems, pp. 3320–3328. Cited by: §2.4, §4.1.
  • H. Zhou, A. Oliver, J. Wu, and Y. Zheng (2018) When semi-supervised learning meets transfer learning: training strategies, models and datasets. arXiv preprint arXiv:1812.05313. Cited by: item 1.