What is the people posting about symptoms related to Coronavirus in Bogota, Colombia?

During the last months, there is an increasing alarm about a new mutation of coronavirus, covid-19 coined by World Health Organization(WHO) with an impact in many areas: economy, health, politics and others. This situation was declared a pandemic by WHO, because of the fast expansion over many countries. At the same time, people is using Social Networks to express what they think, feel or experiment, so this people are Social Sensors and helps to analyze what is happening in their city. The objective of this paper is analyze the publications of Colombian people living in Bogota with a radius of 50 km using Text Mining techniques from symptomatology approach. The results support the understanding of the spread in Colombia related to symptoms of covid19.



page 2


Characterizing Twitter Interaction during COVID-19 pandemic using Complex Networks and Text Mining

The outbreak of covid-19 started many months ago, the reported origin wa...

Sentiment Analysis of Covid-19 Tweets using Evolutionary Classification-Based LSTM Model

As the Covid-19 outbreaks rapidly all over the world day by day and also...

An Opinion Mining of Text in COVID-19 Issues along with Comparative Study in ML, BERT RNN

The global world is crossing a pandemic situation where this is a catast...

Debate on Online Social Networks at the Time of COVID-19: An Italian Case Study

The COVID-19 pandemic is not only having a heavy impact on healthcare bu...

Data Mining Approach to Analyze Covid19 Dataset of Brazilian Patients

The pandemic originated by coronavirus(covid-19), name coined by World H...

Otaku: Intelligent Management System for Student-Intensive Dormitory

In most student dorms in developing countries, a large number of people ...

Hope Amid of a Pandemic: Is Psychological Distress Alleviating in South America while Coronavirus is still on Surge?

As of July 31, 2020, the COVID-19 pandemic has over 17 million reported ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The impact of the epidemics of coronavirus 2019 (COVID-19) in a globalized world and with more communication tools allows instantaneous communication and in many cases without verification of the source of the information that it shows may have contraventions for society [1].

For another place, the infoveillance for through the use Twitter (www.twitter.com) can be useful for longitudinal text mining and analysis to allow the analysis of some conditions of the epidemiology in real time as previously described in 2009 by Chew in the H1N1 pandemic [2]

Meanwhile, public health professionals have a increasingly need to establish a feedback loop and monitor real-time online public response and insights during emergency situations to examine the effectiveness of knowledge translation strategies and adapt future communications and educational campaigns to help the population face this pandemic [3].

The dissemination of information can strongly influence people’s behavior and alter the effectiveness of countermeasures implemented by governments. In this regard, models to predict the spread of the virus are beginning to monitor the behavioral response of the population with respect to public health interventions and the communication dynamics behind content consumption [4].

During the last weeks, a big interest about Coronavirus started because of one infection located in Wu-han city in China,the epidemic scale of the recently emerged novel coronavirus in Wuhan, China, has increased rapidly, with cases arising across China and other countries and regions. using a transmission model, it was estimate of 81008 cases and the wuhan city have 21022 (11090-33490) total infections in 1 to 22 January


In relation to Colombia, the first case was registered in Bogota, Colombia. A girl of 19-year-old who returned to Bogota 26 February from Milan, Italy. The woman was recovering at her place of residence. Before this, the young woman was placed in quarantine at her place of residence, with constant medical supervision, and after approximately 10 days it was confirmed that she had overcome the virus and was no longer infected with covid-19. The mayor pointed out that ”contagion to her relatives was also avoided”[6]. Regarding the incidence of COVID-19, it is estimated that by March 18, 2020 in Colombia there are 93 and 2 died according to the record of the Colombian Secretary of Health[7].

The objective of this article is to describe the epidemiological impact of COVID-19 on press publications for 7 days before describing the first case of COVID-19 in Bogota, Colombia. With this, it is intended to describe the publications on twitter associated with the signs of the coronavirus with the advance of the pandemic and the persistence of the people of Bogota in this regard. This paper follows the next organization: section 2 explains the methodology for the experiments, section 3 presents results and analysis. Section 4 states the conclusions and section 5 introduces recommendations for studies related.

Ii Methodology

The present work performs experiments with source data from Twitter with Natural Language Processing and Data Mining(Text Mining) following the next steps:

  • Gather the relevant terms to search on Twitter

  • Build the query for Twitter and collect data

  • Pre-processing data to eliminate words with no relevance(stopwords)

  • Visualization

Ii-a Gather Relevant Terms

Following the next papers[8, 9] were extracted the next terms and translated to Spanish:

  • ’fiebre’,’tos’,’gripe’,’estornudar’,’contagio’,’garganta’

  • ’dolor_cabeza’,’dificultad_respirar’,’congestion nasal’,’mialgia’

  • ’produccion_esputo’, ’hipoxemia’, ’fatiga’

Ii-B Build the Query and collect data

The extraction of tweets is through Twitter API, with the next parameters:

  • date: from 29-12-2019 to 14-03-2020

  • terms: the words about symptoms in the previous subsection

  • geolocalization: the capital of Colombia is Bogota(4.6,-74.083333)

  • language: Spanish

  • radius: around 50 km

Ii-C Preprocessing Data

  • Change format of datatime to year-month-day

  • Eliminate alphanumeric symbols

  • Uppercase to lowercase

  • Eliminate words with size less or equal than 3

  • Add some exceptions

Ii-D Visualization

  • The date of user account creation

  • Tweets per day to analyze the increasing number of posts

  • Cloud of words to analyze the most frequent terms involved per day

Iii Results

The next graphics present the results of the experiments and answer many questions to understand the phenomenon over the population.

Iii-a What about the veracity of the posts?

Nowadays, many users are posting their ideas using Social Networks and there is no control over the veracity of the information. For this reason, one field related to this is the date of the creation of the accounts, this information is presented in Fig. 1

Fig. 1: Data User Creation

Analyzing the previous, a concentration of the dates is around 2010, 2011 then the age of this account is greater than 6 years. So, if fake users wants to post false information, usually the age of the account could be less than 1 year.

Iii-B How often people post and where did they start?

Considering the window for this analysis was from 29-12-2019 to 14-03-2020, there was an expectation of recovering posts for every day but people was not posting about it during the previous date of 08-03-2020. The graphic Fig. 2 shows an increasing number of post during the last days.

Fig. 2: Tweets during the last seven days

Iii-C What is the people posting about Covid19 symptoms?

After preprocessing tweets and remove stopwords, the predominant words from 2020-03-08 to 2020-03-14 are: dolor, cabeza, ivanduque, coronavirus, uribi, fiebre, contagio, manos, gripe, evitar, estornudar taken from the cloud of words in Fig.3.

Fig. 3: Cloud of Words following the histogram of words

Then then most frequent words introduces topics related symptoms(health), besides the graphic shows interest on politics.

Iii-D How is the progress of covid19 in Colombia?

Finally, the image 4 shows the actual increase of infection in Colombia from the start of March, and there is a natural correlation between the increasing number of post per day and the number of infections.

Fig. 4: Number of infections in Colombia

This preliminar analysis helps to understand what is happening in the population in Bogota and this data can be useful to analyze others aspects, phenomenon from different approaches: Economy, Sociology, etc.

Iv Conclusions

A Text Mining approach helps to visualyze what is happening about symptoms of covid19 in Bogota. The relevance of the topic for the people, the increasing number of post, the most relevant terms for day and how the previous ones are naturally correlated to the number of infected people in Colombia.

V Recommendations

API Twitter has a limitation of seven days then if you need to collect data, you must set the range of time. Preprocessing step is necessary because the people posts with any rule on mind.


  • [1] K. Sun, J. Chen, and C. Viboud, “Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study,” The Lancet Digital Health, 2020.
  • [2] C. Chew and G. Eysenbach, “Pandemics in the age of twitter: content analysis of tweets during the 2009 h1n1 outbreak,” PloS one, vol. 5, no. 11, 2010.
  • [3] V. K. Jain and S. Kumar, “An effective approach to track levels of influenza-a (h1n1) pandemic in india using twitter,” Procedia Computer Science, vol. 70, pp. 801–807, 2015.
  • [4] J. Shaman, A. Karspeck, W. Yang, J. Tamerius, and M. Lipsitch, “Realtime influenza forecasts during the 2012–2013 season,” Nature communications, vol. 4, no. 1, pp. 1–10, 2013.
  • [5] J. M. Read, J. R. Bridgen, D. A. Cummings, A. Ho, and C. P. Jewell, “Novel coronavirus 2019-ncov: early estimation of epidemiological parameters and epidemic predictions,” medRxiv, 2020.
  • [6] Semana, ““Ya el primer caso de coronavirus en Bogota fue superado”: Claudia Lopez,” library Catalog: www.semana.com. [Online]. Available: https://www.semana.com/nacion/articulo/coronavirus-primercaso-de-covid-19-en-bogota-fue-superado/657012
  • [7] E. Dong, H. Du, and L. Gardner, “An interactive web-based dashboard to track covid-19 in real time,” The Lancet Infectious Diseases, 2020.
  • [8] Y. Dong, X. Mo, Y. Hu, X. Qi, F. Jiang, Z. Jiang, and S. Tong, “Epidemiological characteristics of 2143 pediatric patients with 2019 coronavirus disease in china,” Pediatrics, 2020. [Online]. Available: https://pediatrics.aappublications.org/content/early/2020/03/16/peds.2020-0702
  • [9] V. Jain and J.-M. Yuan, “Systematic review and metaanalysis of predictive symptoms and comorbidities for severe covid-19 infection,” medRxiv, 2020. [Online]. Available: https://www.medrxiv.org/content/early/2020/03/16/2020.03.15.20035360