Analysing the Extent of Misinformation in Cancer Related Tweets

03/30/2020 ∙ by Rakesh Bal, et al. ∙ University of North Carolina at Chapel Hill IIT Kharagpur Carnegie Mellon University 0

Twitter has become one of the most sought after places to discuss a wide variety of topics, including medically relevant issues such as cancer. This helps spread awareness regarding the various causes, cures and prevention methods of cancer. However, no proper analysis has been performed, which discusses the validity of such claims. In this work, we aim to tackle the misinformation spread in such platforms. We collect and present a dataset regarding tweets which talk specifically about cancer and propose an attention-based deep learning model for automated detection of misinformation along with its spread. We then do a comparative analysis of the linguistic variation in the text corresponding to misinformation and truth. This analysis helps us gather relevant insights on various social aspects related to misinformed tweets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Social networks play a vital role in information dissemination among a large mass of people. These platforms are easily accessible and can be used to express views and information comfortably. Various microblogging sites like Twitter allow people to share their opinion on topics ranging from personal issues to major political events such as elections. Healthcare is one such popular domain where people freely share their thoughts and experiences. However, on these platforms, there is no check to ensure the sanctity of the content posted, and thus the detection and control of misinformation is a task of utmost importance. To this end, we present a new dataset consisting of tweets about various aspects of cancer treatment and causes. We also formulate deep learning models to detect such tweets automatically. Our contributions to this paper can be summarized as follows:

  • We propose a new dataset to identify prevalent causes and cures related to cancer in online social media (tweets).

  • We develop algorithms for identifying “medically relevant” tweets and propose an attention-based sequence labelling model for extracting objects said to cause and cure cancer.

  • We apply our proposed algorithms on our entire Twitter corpus to identify terms frequently associated with cancer. We also analyse the linguistic differences in the text of informed and misinformed tweets. Our analysis provides an insight into the social aspects related to the spread of misinformation related to cancer.

Related Work

Misinformation spread in social media is a common nuisance in modern times, and this has affected our beliefs and social constructs ranging from politics to healthcare. [1] discusses the role of misinformation diffusion in politics and how it has affected democratic institutions. The authors in this work use two datasets of social media posts from Facebook and Twitter and discuss the trend in the diffusion of fake content in these social platforms. On a similar note [20] also discuss the role of misinformation discussion in the political scenario and how old rumours are again resurfaced as “news”. [2] discuss in their work of how the people are misinformed about various news or political scenarios by only reading bits off social media posts. [3] shows how false news gets spread about global health topics in social media. The authors study ways to limit the spread of misinformation spread by using various correction measures and doing a human study to understand how effective a correction algorithm is in doing so. [8]

in their work build a classifier to identify the users who are prone to spread false information in Twitter about treatment techniques for cancer by using several user based attributes.

There have been several studies on analysing misinformation on YouTube videos across different medical domains like prostate cancer [16], orthodontics [13], and idiopathic pulmonary fibrosis [9]. The studies have highlighted that despite having colossal user-engagement, popular YouTube videos on medical issues are plagued with biases, and inaccuracies and generate unreliable and poor-quality information.

Likewise, there have been seminal work in understanding the multi-faceted nature of health misinformation such as temporal trends in anti-vaccine discourse [11], identifying people susceptible to spread misinformation [8], the trajectories of information spread of Ebola in Twitter [15], and the bursty nature of rumour related topics [7]. In this work, we look at the problem from an information extraction perspective wherein we attempt to identify medically relevant tweets pertaining to cancer from the Twitter stream and automate the extraction of causes and treatments related to cancer.

Dataset Description

We used the Twitter Streaming API111, to collect tweets having certain keywords in conjunction with cancer. These keywords were manually identified from the work of [5], which recognises common words associated with cancer rumours on online social media boards. The initial set of keywords included “cure”, “family”, “attack”, “cause”, “radiation” etc., to conform with the different set of rumour categories as mentioned in [5]. The tweets were collected for 29 days, from 11th June 2018 to 9th July 2018.

After removing retweets and duplicates, we finally obtained 20,137 unique tweets. Out of these, mentioned about objects which cause cancer, talked about entities which cure cancer, whereas the rest involved discussions on things which help prevent cancer. For this particular study, we decided to use tweets which mention ‘cause’ and ‘cure’ related to cancer since they occur more frequently. We deliberately chose to omit tweets which talk about the prevention of cancer due to reasons as discussed under Preliminary AnalysisPreliminary analysis.


We perform standard preprocessing tasks on the tweets before using them for further analysis. We use regular expressions to remove URLs, mentions, and emoticons. We also segment hashtags based on Camel Case222˙case

and common heuristics applied in similar works since hashtags often contain topical information related to a tweet.

Identifying medically relevant tweets

The initial filtering of cancer-related tweets was accomplished via specific keywords. Thus, there were several tweets which lacked any relevant medical information related to cancer. We define a tweet to be “medically relevant” if it mentions any object as a cause or cure for cancer. Consequently, we define a tweet to be “medically non-relevant” if it fails to mention any object which causes cancer, cures cancer or uses cancer as a metaphor for some other scenario. We illustrate some “medically relevant” tweets and “medically non-relevant” tweets in Table 1. We leverage human annotation to distinguish between the medically relevant and non-relevant tweets on the part of the dataset. Three annotators were employed, each of whom is competent in English and use Twitter regularly, but none of whom is the author of this paper. The results were determined using majority voting. We identify 408, 207 and 386 tweets as medically relevant for cause, cure and prevent respectively. Similarly, we identify 590, 1297 and 184 tweets as non-medically relevant for cause, cure and prevent respectively. We automate the process of classifying such medically relevant tweets and describe it further in the next section.

Domain Medically Relevant Medically Non-relevant
Causes 3 weeks ago, two papers, and the news reports that followed, caused a short-lived scare that #CRISPR causes cancer. This wasn’t the first CRISPR scare, and it won’t be the last. Cancer is my biggest fear. Cause you could live the most moral life, exercise, eat well and still one day you wake up and your life is a mess. HIV is a choice (in most cases).
Cures The major demand for rhino horn is in Asia where it’s used in ornamental carvings and traditional medicine. Rhino horn is touted as a cure for hangovers, cancer and impotence. Science has proven all this to be false. No matter what our president does, he will never get the credit he deserves. He could cure cancer, and everyone would say.. he should of been worrying about heart disease. #TrumpKimSummit
Prevents Garlic has sulfur compounds such as allicin which may help prevent infections by blocking specific enzymes. There is research that links garlic intake to a decreased risk of cancer, specifically stomach, colon and esophagus. #diet #nutrition #food #Health Uganda Cancer Institute and Childcare for Cancer Foundation start a drive to equip students across the country with knowledge to prevent and fight cancer. This and more on #NTVATONE coming up
Table 1: Examples of medically relevant and non-relevant tweets

Extracting the anchors

Having identified the “medically relevant” tweets, we ask the same set of annotators to extract the objects which are mentioned to cause or cure cancer respectively, on the same portion of the dataset. These objects are hereby referred to as “anchors”. We also discuss automating the process of identifying these anchors in the subsequent section.

Figure 1: Deep Learning model used for the BIO tagging.

Preliminary analysis

On performing an initial analysis on the partly annotated dataset, we discover certain approximate percentages of the tweets which were spreading misinformation. This is done by manually checking the anchors of randomly selected tweets and verifying if they are indeed causing, curing or preventing cancer. The results show that approximately of the tweets related to causes of cancer, among the ones which discuss curing cancer and of the tweets which discuss preventing cancer can be classified as misinformation. Thus, we choose to focus our research on tweets which talk about causing or curing cancer only. We neglect “prevents” owing to the fact that the initially determined approximate percentage of misinformation is pretty less. This result helps us get an insight that any activity which has a positive effect on the mind and body could help in preventing cancer. We also find that across all the three categories, the ratio of positive to negative stance was of the order of . This allowed us to specifically exclude stance detection since their presence was by and large consistent.


Classifying medically relevant and non-relevant tweets

We handle this classification task by using a simple feed-forward fully connected deep learning network with hidden layers – each having , and neurons respectively using binary cross-entropy loss. For analysis, we experimented with both tfidf and tfidf weighted PubMed [23] [4]

pretrained embeddings to convert the tweets to sentence vectors. For our objective, the latter was determined to be more suitable. These sentence vectors are then fed into our model and trained to output the label of the tweet, i.e.,

for medically relevant and for medically irrelevant tweets. The dataset was divided into ratio for training and validation. On running the model on the entire dataset, we discover that around of the tweets were medically relevant.

Attention based sequence labelling to detect anchors.

Sequence labelling model [10] is a generative framework used to map an arbitrary length sequence of inputs to corresponding labels. In this case, we learn the context of the tweet from the sequence of words in it and then generate the list of BIO (Begin-Inside-Outside) tags, as discussed in the next subsections. The BIO tags help to demarcate the “anchors” in the tweet. For example,

Tweet: “Processed meats causes cancer according to #WHO.” BIO Tags: {B-anchor, I-anchor, O, O, O, O, O}

Automating detection of tweets discussing causes of cancer

In this work, we propose an attention-based BiLSTM-CRF model [12] for the detection of anchor words (Figure 1). We use pretrained PubMed word embeddings for each of the words in a tweet, weight them through an attention network, and feed them through a Bidirectional LSTM. Finally, Conditional Random Field (CRF) [14] is used before producing the output. We used a simple attention network featuring two hidden layers with and neurons each.

We also try out other model variants such as BiLSTM with softmax, simple CRF, etc. We experimented with self-attention inspired from transformer [22] encoder network. We also attempted various kinds of word embeddings, such as PubMed and skip-gram embedding, learnt from the text itself. The results have been presented in Table 3. As evident, the results from attention-based BiLSTM-CRF with PubMed embeddings provided the best results.

Simple tfidf tfidf weighted PubMed
Domain F1 score Accuracy F1 score Accuracy
causes 0.7818
prevents 0.8341
cures 0.5106
Table 2: Medical relevance detection

Automating detection of tweets discussing cures of cancer

The above-mentioned methodologies were also implemented for tweets which discuss about curing cancer. However, we found that the results from these approaches were not up to the mark. Hence, we took an alternate approach.

According to the National Cancer Institute [21], chemotherapy, radiation therapy, immunotherapy, targeted therapy, or hormone therapy are the very specific techniques that are recognised to potentially cure cancer. Misinformation can be detected in tweets referring to “cures” of cancer by determining the presence of objects or techniques similar to this fixed set of proven cures by comparing their similarity in a vector space. This enabled us to formulate our approach towards detecting “anchors” in tweets which talk about curing cancer. The word vectors, in their PubMed embedding vector form, are compared with the word vectors of the exhaustive list of items that cure cancer. We take two simple linear combinations of the dimensions of the vector space spanned by the word embeddings. If the first linear combination is higher than the second, we conclude the presence of a word in the tweet, which can cure cancer. Absence of any such “similar” word signifies that the tweet speaks about a technique or item that is not proven to cure cancer.

Method F1 score - Causes F1 score - Cures
CRF (skip-gram)
CRF (PubMed)
CRF (skip-gram + POS)
CRF (skip-gram + POS + deptag)
CRF (PubMed + POS + deptag)
BiLSTM-Softmax (skip-gram)
BiLSTM-CRF (PubMed + POS + deptag)
BiLSTM-CRF (skip-gram)
Simple attention BiLSTM-CRF (PubMed) 0.6846 0.5821
Self attention BiLSTM-CRF (PubMed)
Table 3: Results from various methods on anchor detection for causes and cures of cancer


Detection of medically relevant tweets

As evident from the Table 2 using tfidf weighted PubMed outperformed using simple tfidf for word representation. We obtain F1 score of for “causes”, for “prevents”, and for “cures” using tfidf-weighted PubMed embeddings.

Analysing tweets discussing causes of cancer

We curated the top 20 most discussed keywords on things which cause cancer. We discovered that the keywords which account for misinformation comprise

of all the tweets concerned. The approach of considering keywords as the basis of estimating misinformation highlights its

spread as we have taken into consideration only those topics which are most discussed on. This is significantly greater than the previous analysis of (subsection - Preliminary Analysis), which was on the basis of the number of tweets containing misinformation. Thus, we can see that the spread of misinformation is higher compared to the number of misinformed tweets. Top keywords featuring in misinformed tweets were detected as “sun”, “meat”, “coffee”, “bacon”, “sunscreen”, “crispr”, “sugar” etc. As mentioned earlier, the best F1 score of was achieved using an attention-based BiLSRM CRF model.

Analysing tweets discussing cures of cancer

We also curated the top 20 most discussed keywords on things which cure cancer. In this case, We discovered that the keywords which account for misinformation comprise of all the tweets concerned. Once again, the spread of misinformation based on keyword analysis is significantly greater than the previous estimate of (subsection - Preliminary Analysis). Top keywords featuring in misinformed tweets were detected as “cannabi”, “hemp”, “dog ’s urin”, “carrot juic”, “nimbolid”, “herbal”, “immunotherapy” etc. As mentioned earlier in Table 3, the best F1 score of was achieved using an attention-based BiLSRM CRF model. However, we took an alternate approach as described in the subsection Automating detection of tweets discussing cures of cancer and found the F1 score to be .


We derived certain insights based on the social aspects. We separated tweets which were spreading misinformation from the ones with correct information and tried to see the linguistic variations between the two groups. We performed simple statistical T-tests to find statistically significant (two-tailed p-values

) variations between the two groups. We used LIWC [19]

, NRC lexicons based on Word emotion

[17] and Valence, Arousal and Dominance [18]. We also used Empath [6]

and extracted a total of 251 different features from the texts. Here, we mostly discuss a few social terminologies which were recognised in this process and also present the odds ratio for each of them (given in brackets). The complete odds ratio is given in Figure


Figure 2: Linguistic variations between correct tweets (left) and the tweets with misinformation (right).

First, we find that tweets containing misinformation were attractive in nature (). Also, they are prone to the notice of journalists (). These could be potential reasons as to why the spread of such tweets have drastically increased compared to their initial count. Text backed by news agencies, though it might contain misinformation, is likely to gain more attention. Moreover, with the advancement of scientific prowess, certain accepted attributes related to cancer might change over time. Hence a tweet, which had provided correct information at an earlier time, could be misleading in the future. One such example has been presented in the first tweet in Table 1, under the Medically Relevant column.

Trust issues () and deception () also regularly featured in tweets containing misinformation. Apparently, both these attributes are interlinked, given that misleading information would naturally be pretentious.

Certain other aspects, which we found to be common in misinformed tweets were swear terms () and joy (). On the flip side, sadness () was one of the key attributes in the tweets containing the correct information. These tweets were also found to be more scientific () in their content. Besides, such tweets also accounted for the presence of the elements of fear () and death (). From this analysis, we see that tweets spreading misinformation tend to use more pronouns, personal pronouns (i, we, you) and certainty words (always, never).


We discuss the spread of misinformation about diseases and health issues, specifically about cancer in social media (Twitter). To this end, we present a dataset of tweets with partial annotations (to be publicly released). We build classifiers to distinguish a medically relevant tweet from non-relevant ones and design sophisticated neural approaches to identify objects/item/techniques which are said to cause or cure cancer in a medically relevant tweet. We also present insights based on social aspects which are important to understand and curb the spread of misinformation. For instance, our results showed incorrect journalism is a major cause behind the spread of such misinformation. We believe this will motivate further work and research in designing methods to curb the spread of misinformation in social media.


We would like to thank Prof Mainack Mondal ( for his continued support throughout the development of this work.


  • [1] H. Allcott, M. Gentzkow, and C. Yu (2019) Trends in the diffusion of misinformation on social media. Research & Politics 6 (2), pp. 2053168019848554. Cited by: Related Work.
  • [2] N. M. Anspach and T. N. Carlson (2018) What to believe? social media commentary and belief in misinformation. Political Behavior, pp. 1–22. Cited by: Related Work.
  • [3] L. Bode and E. K. Vraga (2018) See something, say something: correction of global health misinformation on social media. Health communication 33 (9), pp. 1131–1140. Cited by: Related Work.
  • [4] Q. Chen, Y. Peng, and Z. Lu (2019) BioSentVec: creating sentence embeddings for biomedical texts. In 2019 IEEE International Conference on Healthcare Informatics (ICHI), pp. 1–5. Cited by: Classifying medically relevant and non-relevant tweets.
  • [5] N. DiFonzo, N. M. Robinson, J. M. Suls, and C. Rini (2012) Rumors about cancer: content, sources, coping, transmission, and belief. Journal of Health Communication 17 (9), pp. 1099–1115. Note: PMID: 22724591 External Links: Document, Link, Cited by: Dataset Description.
  • [6] E. Fast, B. Chen, and M. S. Bernstein (2016) Empath: understanding topic signals in large-scale text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 4647–4657. Cited by: Discussion.
  • [7] A. Ghenai and Y. Mejova

    Catching zika fever: application of crowdsourcing and machine learning for tracking health misinformation on twitter

    Cited by: Related Work.
  • [8] A. Ghenai and Y. Mejova (2018) Fake cures: user-centric modeling of health misinformation in social media. Proceedings of the ACM on Human-Computer Interaction 2 (CSCW), pp. 58. Cited by: Related Work, Related Work.
  • [9] G. C. Goobie, S. A. Guler, K. A. Johannson, J. H. Fisher, and C. J. Ryerson (2019) YouTube videos as a source of misinformation on idiopathic pulmonary fibrosis. Annals of the American Thoracic Society 16 (5), pp. 572–579. Cited by: Related Work.
  • [10] A. Graves (2012) Supervised sequence labelling. In

    Supervised sequence labelling with recurrent neural networks

    pp. 5–13. Cited by: Attention based sequence labelling to detect anchors..
  • [11] K. Gunaratne, E. A. Coomes, and H. Haghbayan (2019) Temporal trends in anti-vaccine discourse on twitter. Vaccine 37 (35), pp. 4867–4871. Cited by: Related Work.
  • [12] Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: Automating detection of tweets discussing causes of cancer.
  • [13] D. D. Kılınç and G. Sayar (2019) Assessment of reliability of youtube videos on orthodontics. Turkish journal of orthodontics 32 (3), pp. 145. Cited by: Related Work.
  • [14] J. Lafferty, A. McCallum, and F. C. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: Automating detection of tweets discussing causes of cancer.
  • [15] H. Liang, I. C. Fung, Z. T. H. Tse, J. Yin, C. Chan, L. E. Pechta, B. J. Smith, R. D. Marquez-Lameda, M. I. Meltzer, K. M. Lubell, et al. (2019) How did ebola information spread on twitter: broadcasting or viral spreading?. BMC public health 19 (1), pp. 438. Cited by: Related Work.
  • [16] S. Loeb, S. Sengupta, M. Butaney, J. N. Macaluso Jr, S. W. Czarniecki, R. Robbins, R. S. Braithwaite, L. Gao, N. Byrne, D. Walter, et al. (2019) Dissemination of misinformative and biased information about prostate cancer on youtube. European urology 75 (4), pp. 564–567. Cited by: Related Work.
  • [17] S. M. Mohammad and P. D. Turney (2013) Crowdsourcing a word-emotion association lexicon. 29 (3), pp. 436–465. Cited by: Discussion.
  • [18] S. M. Mohammad (2018) Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 english words. In Proceedings of The Annual Conference of the Association for Computational Linguistics (ACL), Melbourne, Australia. Cited by: Discussion.
  • [19] J. W. Pennebaker, R. L. Boyd, K. Jordan, and K. Blackburn (2015) The development and psychometric properties of liwc2015. Technical report Cited by: Discussion.
  • [20] J. Shin, L. Jian, K. Driscoll, and F. Bar (2018) The diffusion of misinformation on social media: temporal pattern, message, and source. Computers in Human Behavior 83, pp. 278–287. Cited by: Related Work.
  • [21] () Treatment for cancer - national cancer institute. Note: Cited by: Automating detection of tweets discussing cures of cancer.
  • [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Automating detection of tweets discussing causes of cancer.
  • [23] Y. Zhang, Q. Chen, Z. Yang, H. Lin, and Z. Lu (2019) BioWordVec, improving biomedical word embeddings with subword information and mesh. Scientific data 6 (1), pp. 1–9. Cited by: Classifying medically relevant and non-relevant tweets.