In the current age of internet, online social media sites have become available to all people, and people tend to post their personal experiences, current events, local and global news. For this reason, the daily usages of social media are growing up and making a large dataset that becomes an important source of data to make different types of research analysis. Moreover, the social media data are real time data and accessible to monitor. Therefore, several research works are conducted to perform different types real time predictions using the social media data, such as stock movement prediction (Nguyen et al., 2015), relation extraction (Ritter et al., 2015), and natural disaster prediction (Yoo et al., 2018; Mai et al., 2020).
Twitter is such a social media site that can be accessed through people’s laptops and smartphones. The rapid growth of smartphone or laptop usages enables people to share an emergency that they observe in real time. For this reason, many disaster relief organizations and news agencies are interested in monitoring Twitter data programmatically. However, unlike long articles, tweets are short length text, and they tend to have more challenges due to their shortness, sparsity (i.e., diverse word content) (Chen et al., 2011), velocity (rapid growth of short text like SMS and tweet) and misspelling (Alsmadi and Gan, 2019). For these reasons, it is very challenging to understand whether a person’s words are announcing a disaster or not. For example, a tweet like this, ” ! ! ” tells us an experience of a person in a concert and we can say from that he enjoyed it, because of the word, ””. Even though it contains the word, ””, it does not mean any danger or emergency; rather, it is used to describe the colorful decoration of the stage. Let’s assume another tweet like this, ” ”. Here, the word “” means disaster, and the tweet describes an emergency. The two examples show that one word could have multiple meanings based on its context. Therefore, understanding the context of words is important to analyze a tweet’s sentiment.
. Neural network-based methods such as Skip-gram(Mikolov et al., 2013), FastText (Bojanowski et al., 2016)
are popular for learning word embeddings from large word corpus and have been used for solving different types of NLP tasks. These methods are also used for sentiment analysis of Twitter data(Deho et al., 2018; Poornima and Priya, 2020). However, those embedding learning methods provide static embedding for a single word in a document. Hence, the meaning of the word,“” would remain the same in the above two examples for these methods.
To handle this problem, the authors of (Devlin et al., 2018) proposed a contextual embedding learning model, Bidirectional Encoder Representations from Transformers (BERT), that provides embeddings of a word based on its context words. In different types of NLP tasks such as text classification (Sun et al., 2019)2019), entity recognition (Hakala and Pyysalo, 2019), BERT model outperformed traditional embedding learning models. However, it is interesting to discover how the contextual embeddings could help to understand disaster-type texts. For this reason, we plan to analyze the disaster prediction task from Twitter data using both context-free and contextual embeddings in this study. We use traditional machine learning methods and neural network models for the prediction task where the word embeddings are used as input to the models. We show that contextual embeddings work better in predicting disaster-types tweets than the other word embeddings. Finally, we provide an extensive discussion to analyze the results.
The main contributions of this paper are summarized as follows.
We analyze a real-life natural language online social network dataset, Twitter data, to identify challenges in human sentiment analysis for disaster-type tweet prediction.
We apply both contextual and context-free embeddings in tweet representations for disaster prediction through machine learning methods and show that context-free embeddings (BERT) can improve the accuracy of disaster prediction compared with contextual embeddings.
We provide a detailed explanation of our method and results and share our codes publicly that will enable researchers to run our experiments and reproduce our results for future research work directions 111 https://github.com/ashischanda/sentiment-analysis.
The rest of the paper is organized as follows. In section 2, some related works are introduced. The main methodology of this paper is elaborated in section 3. The dataset and the experiments are presented in section 4 and 5, respectively. Finally, the conclusion is drawn in section 6.
2. Related Works
Many research works analyzed Twitter data for understanding emergency situation and predicting disaster analysis (Karami et al., 2020; Zou et al., 2018; Ashktorab et al., 2014; Olteanu et al., 2014). One group of researchers used text mining and statistical approaches to understand crises (Karami et al., 2020; Zou et al., 2018), another group of researchers focused on clustering text data to identify a group of tweets that belong to disaster (Ashktorab et al., 2014; Olteanu et al., 2014). Later, different traditional machine learning models are used to analyze Twitter data and predict disaster or emergency situations where words of a tweet are represented as embeddings (Palshikar et al., 2018; Algur and Venugopal, 2021; Singh et al., 2019). For example, Palshikar et al. (Palshikar et al., 2018) proposed a weakly supervised model where words are presented with a bag of words (BOW) model. Moreover, frequency-based word representation is used in (Algur and Venugopal, 2021)et al., 2019) used a Markov-based model to predict the location of Tweets during a disaster. In a recent work (Pota et al., 2021), the authors proposed a pre-processing method for BERT-based sentiment analysis of tweets. However, it is interesting to explore the model performance on different word embeddings to observe how the context words help to predict a tweet as a disaster.
In this section, we discuss our approach of leveraging word embedding for disaster prediction from Twitter data using machine learning methods. We consider three types of word embeddings, 1) bag of words (BOW), 2) context-free, and 3) contextual embeddings. The word embeddings are used in both traditional machine learning methods and deep learning models as input for disaster prediction.
3.1. BOW embeddings
The bag-of-words (BOW) model is a common approach for text representation of a word document. If there are words in a text vocabulary, then BOW is a binary vector or array of length
where each index of the array is used to present one word of the vocabulary. If a word exists in a document, then the corresponding array index of the word becomes one; otherwise, it contains zero. We use BOW embeddings of Twitter data in three traditional machine learning methods such as decision tree, random forest, and logistic regression to predict the sentiment of a tweet.
Even though BOW is good for representing words of a document, it loses contextual information because the order of words is not recorded in the binary structure. However, contextual information is required to understand and analyze the sentiment of a text. For this reason, we also plan to use context-based embeddings for this sentiment analysis task.
3.2. Context-free embeddings
Many existing research works proposed to learn word embeddings based on the co-occurrences of word pairs in documents. GloVe (Pennington et al., 2014) is one common method for learning word embeddings from the co-occurrences of words in documents. However, neural network-based models such as Skip-gram (Mikolov et al., 2013), FastText (Bojanowski et al., 2016) became popular recently to learn word representations from documents and used for sentiment analysis.
In our research study, we use the pre-trained embeddings of three context-free embedding models (GloVe, Skip-gram, FastText) in a neural network-based model to analyze the sentiment of tweet data and predict disaster types tweets. To represent a tweet in context-free embeddings, we took the average of word embeddings of a tweet following the same strategy of (Kenter et al., 2016). For the calculated vector of a tweet, we use softmax to predict the sentiment of the tweet. Let’s suppose that the vector of a tweet is , we have a set of labels, = ”positive”, ”negative” and
is the weight matrix of softmax function. Then, the probability of the tweet to be positive or disaster is calculated as follows
Recently, deep neural networks are also used for sentiment analysis. To observe how the context-free embeddings work on deep neural networks, we used a bidirectional recurrent neural network with LSTM gates(Hochreiter and Schmidhuber, 1997)
. The Bi-LSTM model processes the input words of a tweet from right to left and in reverse. The Bi-LSTM block is followed by a fully connected layer and sigmoid function to output as an activation function.
3.3. Contextual embeddings
Unlike the other word embeddings, BERT embeddings (Devlin et al., 2018) generates different vectors for the same word in different contexts. Recent advances in NLP have shown that BERT model have outperformed traditional embeddings in different NLP tasks, like entity extraction, next sentence prediction. In our study, we plan to investigate how well do contextual embeddings work better than traditional embeddings in sentiment analysis. For this purpose, we use the pre-trained embeddings of BERT models in the same neural network models to predict disaster types tweets.
|Tweet (original)||Tweet (after preprocessing)|
|RockyFire Update = California Hwy. 20 closed in both||rockyfire update california hwy 20 closed|
|directions due to Lake County fire - CAfire wildfires||directions due lake county fire cafire wildfires|
|TheAtlantic That or they might be killed in an airplane||theatlantic might killed airplane accident|
|accident in the night a car wreck||night car wreck|
|Total train data||7,613|
|Total positive data (or disaster tweets)||3,271|
|Total unique words||21,940|
|Total unique words with frequency 1||6,816|
|Avg. length of tweets||12.5|
|Median length of tweets||13|
|Maximum length of tweets||29|
|Minimum length of tweets||1|
For this study, we used a Twitter dataset from a recent Kaggle competition (Natural Language Processing with Disaster Tweets 222 https://www.kaggle.com/c/nlp-getting-started). Kaggle competition is a very well-known platform for machine learning researchers where many research agencies share their data to solve different types of research problems. For example, many researchers used data from Kaggle competitions to analyze real-life problems and propose models to solve the problems, such as sentiment analysis, feature detection, diagnosis prediction (Koumpouri et al., 2015; Tolkachev et al., 2020; Iglovikov et al., 2017; Yang et al., 2018; Yang and Ding, 2020).
In the selected Kaggle competition, a dataset of 10,876 tweets is given to predict which tweets are about real disasters and which ones aren’t using machine learning model. This dataset has two separate files, train (7,613 tweets) and test (3,263 tweets) data, where each row of the train data contains id, natural language text or tweet, and label. The labels are manually annotated by humans. They labeled a tweet as positive or one if it is about real disaster, otherwise as negative or zero. On the other hand, the test data has D and natural language text but no label. The competition site stores the labels of test data privately and uses that to calculate test scores based on user’s machine learning model predictions and create leader-board for the competition based on the test score. Moreover, this dataset was created by the figure-eight company and originally shared on their website 333 https://appen.com/open-source-datasets/.
We used the training data to train different machine learning models and predict test data labels using trained models. We reported both the train and test data score in our experiment. Note that our purpose is not to get a high score in the competition, rather use Twitter data to study our research goals.
4.1. Data pre-processing
Since the twitter data is natural language text, and it contains different types of typos, punctuation, abbreviations, and numbers. For this reason, before training machine learning models on the natural language text, a text pre-processing step is required to remove stop words and word tokenization. Hence, we removed all the stopwords and punctuations from the training data and converted all the words into lower-case letters. Table 1 shows some pre-processed tweets with the original tweets.
4.2. Data analysis
Before running any machine learning methods on our data, we analyzed our dataset to obtain some insights about the data. Table 2 shows some statistical results on the training data after pre-processing the text. From the table, we find that there are 43 tweets that are annotated as real disasters and 57 are not. There are a total of 21,940 unique words, while only 6,816 words have frequency 1. The average length of tweets is 12.5. However, it is important to check the length of positive and negative tweets separately to verify whether they have common characteristics. Figure 1 shows word distribution for both the positive and negative tweets. The figure shows many negative tweets with small word lengths (10), but most positive and negative tweets are in a word length of 10 to 20.
We also analyzed the word frequency for positive and negative tweets. Figure 2 shows the most frequent words in a word cloud where the high font of a word presents high frequency. We can find some common words in both types of tweets (i.e., https, t, co, people). However, Figure 2(a) highlights many disaster-related words like storm, fire, bomber, death, and earthquake. On the other hand, Figure 2
(b) highlights daily used words such as think, good, love, now, time. From this figure, it is clear that the most frequent words are different in the two types of tweets, and understanding the meaning of words is important to classify them.
In our experimental study, we conduct several experiments based on the real Twitter data to predict disaster-types tweets. At first, we describe the experimental settings and model training procedures in this section. Then, we analyze the experimental results in detail.
5.1. Experimental settings
5.1.1. Traditional ML models with BOW embeddings
From the Table 2, we find that the training data has 21,940 unique words where 6,816 words have frequency more than 1. To avoid infrequent words, we considered only the vocabulary of 6,816 words in our BOW representations. To represent a tweet in BOW embeddings, we took a binary array of 6,816 length where it had 1 if a word of tweet was present in the vocabulary, otherwise 0. We used the BOW embeddings to predict the sentiment of a tweet using three traditional machine learning models, 1) decision tree, 2) random forest and 3) logistic regression. We used python Sklearn package 444https://scikit-learn.org/stable/ and used all the default parameters to train the models on our train dataset. After training the model, we used the test data to get labels and submit that in Kaggle to have test score.
5.1.2. Deep learning models with context-free embeddings
For this experiment, we chose three context-free methods, 1) Skip-gram (Mikolov et al., 2013), 2) FastText (Bojanowski et al., 2016), and 2) GloVe (Pennington et al., 2014). We used publicly available pre-trained embeddings of Skip-gram and fastText models that are trained on Wikipedia data 555 https://nlp.stanford.edu/projects/glove/. The pre-trained embeddings of FastText is collected from Mikolov et. al. 2018 (Mikolov et al., 2018). The size of all the pre-trained embeddings or feature is 300.
5.1.3. Deep learning models with contextual embeddings
To obtain contextual embeddings, we downloaded publicly available pre-trained BERT model (Bert-base-uncased) (Devlin et al., 2018) from the official site of the authors 666https://github.com/google-research/bert. We gave tweets as inputs in the BERT model and took the hidden states of the  token of the last layer from the model as embeddings of the given tweets. Then, the embeddings is used in our sigmoid model to predict sentiment of tweets. The same setting is used in a previous paper (Ji et al., 2021) to predict patient diagnosis from medical note words using pre-trained BERT model.
Moreover, we can find embeddings of each words of a tweet from the pre-trained BERT model. The BERT’s pre-trained word embeddings are used as input to our Bi-LSTM model. The authors of (Lu et al., 2020) used the similar setting for the sentiment analysis of text data.
5.2. Evaluation metric
Three different metrics are used in our experiment to evaluate the performance of the machine learning models on the disaster prediction task such as, 1) accuracy, 2) F1 score, and 3) Area Under the Curve (AUC). In our experiment, we considered disaster tweets as ’positive class’ and others as ’negative class’. Hence, True Positive (TP) means the actual disaster tweets that are predicted as disaster while False Positive (FP) shows the tweets that are actually false, but predicted as true. True Negative (TN) and False Negative (FN) imply in the same way. The accuracy is the number of correctly predicted tweets among all of the tweets and it is calculated as follows.
F1 score is another popular metric to test predictive performance of a model. The F1 score is measured by the harmonic mean of recall and precision where recall means the number of true labels are predicted by a model among the total number of existing true labels and precision means the number of true labels are predicted by a model divided by the total number of labels are predicted by the model. The F1 score is calculated as follows.
On the other hand, AUC tells us how much a model is capable of distinguishing between classes. The higher score of the AUC means the model works better at predicting negative classes as zero and positive classes as one.
5.3. Experimental results
5.3.1. Quantitative results
Table 3 provides the results of all the machine learning models on the disaster prediction tasks for all three types of embeddings. The table shows results for both the training and test data. Since the test data results are collected from the Kaggle competition, we only can report the accuracy score.
The table shows that the logistic regression model has the best results for the BOW embeddings among the three traditional machine learning models. However, the results of neural network models for context-free embeddings are better than the traditional machine learning models that used context-free embeddings as inputs. Among the three context-free embeddings (Skip-gram, FastText, GloVe), the GloVe with Bi-LSTM model has the best train and test score for all the three evaluation metrics. Note that the results also show us that deep learning model like Bi-LSTM has better results than the shallow neural network model such as the softmax model.
Moreover, when we used the same shallow neural network and deep learning models for contextual embeddings such as BERT; we found that there are 2 improvements on AUC and Acc over the context-free embeddings. It means that contextual embeddings are helpful and have the best performance for the disaster prediction task.
|Model||Train data||Test data|
5.3.2. Qualitative results
Table 3 shows us quantitative results for the prediction of disaster tweets where the neural model with contextual embeddings outperformed the other models. However, it is difficult to understand from the result that when the contextual embeddings predict a disaster tweet successfully while context-free models fail. For this purpose, we observe the prediction results of the Bi-LSTM model for both the context-free (GloVe) and contextual embeddings (BERT). Table 4 shows the model predictions with true labels for some sample tweets. From the table, we can find that the predictions for GloVe embeddings for the first two tweets are positive, maybe because of the word, “”, in the tweets, but the true labels are negative for the two tweets. If we read the tweets, then we can understand that the tweets are not related to disaster or crisis. On the other hand, since the BERT model generates word embeddings based on the context words, it successfully predicts the tweets as negative.
The predictions for GloVe embeddings for the third and fourth tweets are false while they are true. Note that no disaster-related words are used in the two tweets, but the tweets described serious situations. The predictions for BERT embeddings are also correct for the third and fourth tweets. The predictions of GloVe and BERT embeddings for the fifth and sixth tweets of Table 3 are correct. Since there are some disaster-related words (i.e., suicide, bomber, bombing) in the tweets, both models successfully labeled them.
After analyzing the results of Table 4, it can be implied that the context-free embeddings are helpful to predict a tweet as a disaster if disaster-related words (i.e., accident, bomb) exist in the tweet. In contrast, contextual embeddings help to understand the context of a tweet that is challenging and important for the sentiment analysis task. Although every tweet has a short length text, contextual embeddings works efficiently to understand the sentiment of a tweet.
|1||I swear someone needs to take it away from me, cuase I’m just accident prone.||Yes||No||No|
|2||Dave if I say that I met her by accident this week- would you be|
|super jelly Dave? :p||Yes||No||No|
|3||Schoolgirl attacked in Seaton Delaval park by ’pack of animals’||No||Yes||Yes|
|4||Not sure how these fire-workers rush into burning buildings|
|but I’m grateful they do. TrueHeroes||No||Yes||Yes|
|5||A suicide bomber has blown himself up at a mosque in the south||Yes||Yes||Yes|
|6||Bombing of Hiroshima 1945||Yes||Yes||Yes|
In this paper, we described an extensive analysis for predicting disaster from Twitter data using different types of word embeddings. Our experimental results show that contextual embeddings have the best result for predicting disaster from tweets. We also showed that deep neural network models outperformed traditional machine learning methods in the disaster prediction task. Advance deep neural network models such as multi-layer convolutional models can also be used for this prediction task to achieve higher accuracy.
- Classification of disaster specific tweets-a hybrid approach. In 2021 8th International Conference on Computing for Sustainable Global Development (INDIACom), pp. 774–777. Cited by: §2.
- Review of short-text classification. International Journal of Web Information Systems. Cited by: §1.
- Tweedr: mining twitter to inform disaster response.. In ISCRAM, pp. 269–272. Cited by: §2.
- Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: §1, §3.2, §5.1.2.
Short text classification improved by learning multi-granularity topics.
Twenty-second international joint conference on artificial intelligence, Cited by: §1.
- Sentiment analysis with word embedding. In 2018 IEEE 7th International Conference on Adaptive Science & Technology (ICAST), pp. 1–4. Cited by: §1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3.3, §5.1.3.
Biomedical named entity recognition with multilingual bert. In Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, pp. 56–61. Cited by: §1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
Satellite imagery feature detection using deep convolutional neural network: a kaggle competition. arXiv preprint arXiv:1706.06169. Cited by: §4.
- Does the magic of bert apply to medical code assignment? a quantitative study. arXiv preprint arXiv:2103.06511. Cited by: §5.1.3.
- Twitter speaks: a case of national disaster situational awareness. Journal of Information Science 46 (3), pp. 313–324. Cited by: §2.
- Siamese cbow: optimizing word embeddings for sentence representations. arXiv preprint arXiv:1606.04640. Cited by: §3.2.
- Evaluation of four approaches for” sentiment analysis on movie reviews” the kaggle competition. In Proceedings of the 16th International Conference on Engineering Applications of Neural Networks (INNS), pp. 1–5. Cited by: §4.
- Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345. Cited by: §1.
- VGCN-bert: augmenting bert with graph embedding for text classification. Advances in Information Retrieval 12035, pp. 369. Cited by: §5.1.3.
- Big data analytics of twitter data and its application for physician assistants: who is talking about your profession in twitter?. In Data Management and Analysis, pp. 17–32. Cited by: §1.
Efficient estimation of word representations in vector space. CoRR abs/1301.3781. External Links: Cited by: §1, §3.2, §5.1.2.
- Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §5.1.2.
- Sentiment analysis on social media for stock movement prediction. Expert Systems with Applications 42 (24), pp. 9603–9611. Cited by: §1.
Crisislex: a lexicon for collecting and filtering microblogged communications in crises. In Eighth international AAAI conference on weblogs and social media, Cited by: §2.
- Weakly supervised and online learning of word models for classification to detect disaster reporting tweets. Information Systems Frontiers 20 (5), pp. 949–959. Cited by: §2.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1, §3.2, §5.1.2.
- A comparative sentiment analysis of sentence embedding using machine learning techniques. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pp. 493–496. Cited by: §1.
- Multilingual evaluation of pre-processing for bert-based sentiment analysis of tweets. Expert Systems with Applications 181, pp. 115119. Cited by: §2.
- Weakly supervised extraction of computer security events from twitter. In Proceedings of the 24th international conference on world wide web, pp. 896–905. Cited by: §1.
- Event classification and location prediction from tweets during disasters. Annals of Operations Research 283 (1), pp. 737–757. Cited by: §2.
- How to fine-tune bert for text classification?. In China National Conference on Chinese Computational Linguistics, pp. 194–206. Cited by: §1.
- Deep learning for diagnosis and segmentation of pneumothorax: the results on the kaggle competition and validation against radiologists. IEEE Journal of Biomedical and Health Informatics. Cited by: §4.
- A computational framework for iceberg and ship discrimination: case study on kaggle competition. IEEE Access 8, pp. 82320–82327. Cited by: §4.
- Deep learning for practical image recognition: case study on kaggle competitions. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 923–931. Cited by: §4.
- Social media contents based sentiment analysis and prediction system. Expert Systems with Applications 105, pp. 102–111. Cited by: §1.
- Mining twitter data for improved understanding of disaster resilience. Annals of the American Association of Geographers 108 (5), pp. 1422–1441. Cited by: §2.