Survey of Text-based Epidemic Intelligence: A Computational Linguistic Perspective

03/14/2019
by   Aditya Joshi, et al.
CSIRO
UNSW
0

Epidemic intelligence deals with the detection of disease outbreaks using formal (such as hospital records) and informal sources (such as user-generated text on the web) of information. In this survey, we discuss approaches for epidemic intelligence that use textual datasets, referring to it as `text-based epidemic intelligence'. We view past work in terms of two broad categories: health mention classification (selecting relevant text from a large volume) and health event detection (predicting epidemic events from a collection of relevant text). The focus of our discussion is the underlying computational linguistic techniques in the two categories. The survey also provides details of the state-of-the-art in annotation techniques, resources and evaluation strategies for epidemic intelligence.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/22/2020

A Mathematical Dashboard for the Analysis of Italian COVID-19 Epidemic Data

A data analysis of the COVID-19 epidemic is proposed on the basis of the...
06/12/2021

BIOPAK Flasher: Epidemic disease monitoring and detection in Pakistan using text mining

Infectious disease outbreak has a significant impact on morbidity, morta...
11/10/2016

Why is it Difficult to Detect Sudden and Unexpected Epidemic Outbreaks in Twitter?

Social media services such as Twitter are a valuable source of informati...
10/23/2019

Predicting extremes: influenza epidemics in France

Influenza epidemics each year cause hundreds of thousands of deaths worl...
02/10/2016

Automatic Sarcasm Detection: A Survey

Automatic sarcasm detection is the task of predicting sarcasm in text. T...
06/20/2019

Predicting Future Opioid Incidences Today

According to the Center of Disease Control (CDC), the Opioid epidemic ha...
08/19/2020

Epidemic changepoint detection in the presence of nuisance changes

Many time series problems feature epidemic changes - segments where a pa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Epidemics have adversely impacted lives and well-being of individuals, and as a result, economies, for centuries111http://edition.cnn.com/interactive/2014/10/health/epidemics-through-history/. While limitations of medical knowledge accounted for most of the impact, delayed detection and latency of prevalent communication channels were a bottleneck. For instance, about a century ago, slow word-of-mouth communication between medical professionals would aggravate the impact of disease outbreaks due to a late detection. The rise of information technology provided a new channel to medical professionals to exchange information and possibly detect anomalous events in the health of a community. Referred to as ‘epidemic intelligence222https://www.who.int/csr/alertresponse/epidemicintelligence/en/’, such systems aim to provide early warnings of public health emergencies. Using information from digital sources, such surveillance can potentially mobilise a rapid response resulting in reduced rate of morbidity and mortality Yan et al. (2017). Epidemic intelligence has proven to be useful during several instances of outbreaks in the past, including the early detection of an A (H1N1) pandemic Brownstein et al. (2009). An overlapping area of research is syndromic surveillance Henning (2004); Yan et al. (2006), where the goal is to detect syndromes: a collection of related symptoms. A syndrome is a condition characterised by associated symptoms, while a symptom is an indication of a medical condition333Source: Oxford Dictionary.. However, ‘syndromic surveillance’ is no longer restricted to syndromes in its true definition (a collection of related symptoms) Hopkins et al. (2017).

Traditionally, epidemic intelligence relies on structured information from medical institutions and governing bodies such as medical records or weather information respectively Yan et al. (2006). The use of internet to obtain information from such official sources has recently been a popular paradigm for epidemic intelligence  Brownstein et al. (2009). Epidemic intelligence has been impacted by the rise of textual content on the internet. In particular, Web 2.0444https://en.wikipedia.org/wiki/Web_2.0 enabled users to generate textual content on the internet. As a result, many informal sources of data (like online discussion forums) are often first reports of an epidemic555https://www.who.int/csr/alertresponse/epidemicintelligence/en/. The data under consideration for epidemic intelligence may span a spectrum of sources: short phrases in the form of queries entered in search engines, or long documents such as news articles or blogs written by users. Intermediate between the two are posts on digital social media (referred to as ‘social media’ in the rest of the paper). Due to its accessibility and popularity, social media is an attractive source of data for epidemic intelligence, like for other fields of analytics. The value of textual data for epidemic intelligence can be understood by the observation that more than 60% of initial reports of epidemics are received from unofficial informal sources, including text-based sources Wagner et al. (2011). We refer to the detection of epidemics using health-related textual data as text-based epidemic intelligence.

Since it involves textual data, text-based epidemic intelligence uses techniques in computational linguistics/natural language processing (NLP). Text-based epidemic intelligence can be viewed as a two-step process, indicated in Figure X, as follows (Figure will be made available in the full paper):

  • Health mention classification: This step refers to the identification of text that is relevant to epidemic intelligence. For example, Aramaki et al. (2011) predict if a tweet reports an influenza outbreak. As shown in the figure, this step selects text concerning public health risks of interest, from a large pool of textual data (such as Twitter streams).

  • Health event detection: In this step, the extracted information from the relevant text is applied in order to identify health events. An example is the work by Sparks et al. (2010b), where they use exponentially weighted moving average to predict influenza counts over time. As shown in the figure, this step aims to detect epidemics by taking into account a collection of textual units (as opposed to one textual unit in the previous step).

Figure X (Figure will be made available in the full paper) presents an overview of the research in text-based epidemic intelligence in terms of these steps. Events in the real world get manifested in the form of online textual content such as news articles and social media text. In addition, structured textual data such as medical ontologies provide knowledge about the domain. Health mention classification involves computational linguistic techniques based on ontologies, statistical classifiers or topic models. In contrast, health event detection involves identifying a health event corresponding to a possible outbreak in the community. Temporal outbreaks refer to outbreaks over time where the textual units are arranged in a time series. Spatial outbreaks refer to outbreaks over space where the textual units are arranged in a geographical region.

Because it aims to select relevant text from large volumes of text, health mention classification has witnessed more diversity in terms of computational linguistic approaches in comparison with health event detection. Hence, we survey different approaches reported for health mention classification, and highlight strategies peculiar to textual datasets that have been used in health event detection.

The survey paper is organised as follows. Section 2 positions this survey paper among related survey papers. In Section 3, we look at different ways in which past works define their scope. Following this, we describe resources for epidemic intelligence in Section 4. We then survey past work in health mention classification in Section 5 followed by health event detection in Section 6. Section 7 describes the evaluation techniques that have been used. A list of possible future directions is in Section 8, while the conclusion is in Section 9.

2 Motivation

The earliest literature review of syndromic surveillance using open-source data was by Yan et al. (2006) in 2006. It surveyed approaches of the time, primarily based on structured information sources (such as hospital records). Several years later, other papers applied systematic review techniques to summarise syndromic surveillance Charles-Smith et al. (2015); Al-garadi et al. (2016). Bernardo et al. (2013) used a structured scoping review method using a dataset of 101 scientific articles, reports statistics in terms of demographic attributes of authors and themes of these publications. A comprehensive list of existing surveillance systems can be found in Yan et al. (2017). Surveys of other specific problems (namely adversely drug reaction detection) in health informatics are by Karimi et al. (2015) and Sarker et al. (2015). A systematic review closely related to ours is by Velasco et al. (2014). They provided a high-level view of approaches for syndromic surveillance using social media text. Our computational linguistic perspective appears in terms of viewing past work in epidemic intelligence via typical computational linguistic techniques. Our survey is targeted towards enabling computational linguistic as well as epidemic intelligence researchers to understand the state-of-the-art. The novelty of this survey is as follows:

  1. Our survey paper views epidemic intelligence as an application of computational linguistics and presents approaches for syndromic surveillance that process textual data.

  2. Our view of epidemic intelligence divides past work into two steps: Health mention classification and health event detection.

  3. We classify past approaches in health mention classification in terms of the class of well-known NLP paradigms. This helps to understand trends in the types of approaches that have been reported over time.

The need for application of advanced computational linguistic techniques arises from typical challenges involved in text-based epidemic intelligence:

  1. Term presence is not sufficient: Ambiguity is a key challenge for natural language processing Manning et al. (1999), and this holds for epidemic intelligence as well. A sentence containing a symptom/illness may not always be the report of the illness. For example, ‘I have the flu’ is a report of an illness, while ‘Flu is common in winters’ is not. Therefore, health mention classification must be able to distinguish between health reports (where a person reports experiencing certain symptoms) and other tweets. This maps itself to a two-class classification task. The classification becomes more challenging if it is more fine-grained. For example, it may be useful to classify if a given sentence is a suspicion (‘I have a head ache. Maybe I have a flu’), a fact (‘WHO reported that bird flu is likely this year’) or a question (‘Do you have flu?’).

  2. Targets may be important: The target of a health mention is the person who has contracted the disease. For example, in ‘my flu got cured’, the target is the speaker, while in ‘my mother has been down with a flu’, the target is the mother. In case of health event detection, target may play an important role. This means that, in addition to detecting the presence of a health mention, it may be required to detect who has the health mention. If the user mentions their flu in many tweets or if a famous person falls ill, there may be a high number of health mention tweets but it may not always warrant signalling a health event. Kanouchi et al. (2015) present an approach to detect the target of a tweet as one among 1st person, 3rd person, referred person (in the tweet @target, for example), not human, and none. They deal with seven symptoms: cough, cold, headache, chill, runny nose, fever, and sore throat. The target of a health mention may be challenging to determine due to typical challenges in social media text: (a) The subject may be somebody else (‘my brother got flu’), (b) The verb may indicate the target (‘my brother has passed his flu to me’), or (c) subject may be dropped (‘Awake with a headache’).

Given these challenges, this survey presents nuances of computational linguistics techniques in terms of annotation strategies, approaches and evaluation techniques for epidemic intelligence.

3 Scope Definition

We now describe how past papers define their scope. For example, some papers focus on tweets about a mass gathering where there may be a risk of an outbreak, while some other focus on specific contagious diseases. In this section, we discuss several such dimensions that could be useful to formulate a research problem in text-based epidemic intelligence. Gomide et al. (2011) describe four requirements of a epidemic intelligence system: how much (extent), where (location), when (time) and how (manner of spread). To this list, we add ‘what’ to refer to the health condition of interest. We discuss the scope of past work in light of these requirements.

3.1 Illness (‘What’)

The ‘what’ of an epidemic intelligence system is the health condition of interest. This could be an illness characterised by symptoms or a syndrome. Instead of a generic, illness-agnostic system, past work focuses on specific symptoms. A scoping review shows that most past work deals with influenza and influenza-like illnesses Bernardo et al. (2013). Other forms of illnesses that have been studied include sexual health, alcoholism, and drug abuse Charles-Smith et al. (2015). In general, the parameters that influence the choice of syndrome in a study are:

  1. Karisani and Agichtein (2018) state that symptoms that are apparent to a patient can be a good choice if social media datasets are to be used. They refer to text reporting such symptoms as a ‘personal health mention’. The goal of such research is to identify if a given piece of text contains a person reporting an illness or not.

  2. Since the epidemic intelligence approaches need to be validated, the availability of reference counts from official records are important determiners. For example, counts from the Center for Disease Control and Prevention (CDC) or its equivalent in other countries have been used as a source of validation sets Boyle et al. (2011).

  3. Social stigma and privacy concerns arising due to it may prevent certain illnesses from being discussed on social media Fung et al. (2015). In such cases, the volume of social media content reporting the illness may be insufficient to detect an epidemic.

3.2 Time period (‘When’)

The second parameter is the time period. One way to view the time period is in relation to a known outbreak. Sparks et al. (2017) provide two categories, using this relativity:

  1. Retrospective surveillance: Retrospective surveillance looks at an outbreak in retrospect in order to understand past unusual behaviour, using validation data of past outbreaks. Insights from retrospective surveillance can have an impact on prospective surveillance.

  2. Prospective surveillance: This kind of surveillance involves monitoring for health indicators and triggering appropriate flags, if an outbreak is detected. These systems are time-critical in that they aim to detect outbreaks as early as possible Sparks et al. (2010a). Mobile phones have been used for prospective surveillance to ensure that the detection of health signals is within an acceptable time period Rosewell et al. (2013).

3.3 Location (‘Where’)

Some past work also define in their scope a specific event. These are typically high-risk events that may trigger medical emergencies. This may, as a result, cause people to post on the web to express fears, report symptoms, etc. The objective of such work is then to harness this web content to detect these outbreaks. For example, Brownstein et al. (2009) discuss how epidemic intelligence was useful during an early A(H1N1) pandemic. Other event-based focuses have been an ebola outbreak in London Ofoghi et al. (2016), a Zika virus outbreak around the world Adam et al. (2017) or the 2002 Winter Olympic games Chapman et al. (2005). Henning (2004) refer to this kind of surveillance as short-duration or drop-in surveillance, since it is centered around an interval of time for which the event lasts. In addition, some work also focuses on specific cities Chapman et al. (2005), or trains systems on multiple geographical locations Zou et al. (2018).

3.4 Indirect Indicators (‘How much’)

Some research estimates the extent of impact of an outbreak using indirect indicators.

Ofoghi et al. (2016) relate public health threats to public mood about disease names. The hypothesis is that real-world health threats may be discovered by detecting emotions in related social media text. Thus, the context in focus in this case is the sentiment about a syndrome as against the actual incidence of the syndrome. In order to detect the public mood, they formulate an emotion analysis task on health-related tweets, and use it to identify possible threats. Similarly, Larsen et al. (2015) use an emotion analysis tool to detect emotions in social media text as an indicator of health signals.

A Note on Ethics

In addition to the requirements that help to define scope of a epidemic intelligence research, a note on ethical considerations is imperative. Benton et al. (2017) state that public datasets are exempt from being regarded as sensitive.They also state that social media may be useful since it is potentially a public dataset. They provide a starting point for researchers for ethical guidelines. Benton et al. (2017) also describe the importance of ethical guidelines for health surveillance research using social media. They prescribe mechanisms such as an institutional review board, informed consent from participants, and protection of sensitive data. Similarly, Ginsberg et al. (2009) describe techniques like anonymisation of health data using identifiers, or the use of normalised counts instead of specific instances. A detailed discussion on ethics is beyond the scope of this survey. However, papers like

4 Resources

Textual resources are the foundation of a computational linguistic system Joshi et al. (2017)

. Structured resources include ontologies or lexicons that provide semantic information about the domain. Unstructured resources include labeled datasets that contain instances assigned with labels of interest.

4.1 Ontologies

An ontology is a formal, explicit specification of a shared conceptualisation Gruber (1993). Ontologies consist of concepts, and relations that link two or more concepts. Medical ontologies have helped to organise the knowledge of the medical domain Bertaud-Gounot et al. (2012). It must be noted that, apart from epidemic intelligence, these medical ontologies have been used for applications based on medical information extraction or information retrieval.

The Unified Medical Language System (UMLS) provides a popular medical ontology (Lindberg et al., 1993). This is a meta-thesaurus that defines biomedical concepts and relations between them. UMLS was created by experts using information captured in multiple related ontologies such as a gene ontology. UMLS captures relationships of two types: associative (such as has) and hierarchical (such as isa). Associative relationships may be used to relate symptoms with a syndrome where the symptoms are observed. Hierarchical relationships may represent a specialisation chain of illnesses. Other biomedical sources have also been associated with concepts and relations in UMLS. Bodenreider (2004) describe tools that allow customisation and enrichment of the UMLS ontology.

Collier et al. (2007) describe an ontology that captures syndromic knowledge. This ontology, known as the BioCaster ontology, consists of: (a) concepts such as disease, symptom, virus, or syndrome, and (b) relations such as that relates a disease with a symptom, or that relates a disease with the virus that causes the disease. The ontology is multilingual with support in 12 languages, including but not limited to English, Japanese, French, Arabic and Thai.

Okhmatovskaia et al. (2009) present a syndromic surveillance ontology (SSO) for certain classes of illnesses: respiratory, gastrointestinal, constitutional and influenza-like. The ontology allows capturing definitions of a higher granularity for syndromes. Conway et al. (2011) report an extended version of the SSO that covers a broader range of illnesses.

Data Source Nature of Text Advantages Challenges
Search queries Zou et al. (2018); Ginsberg et al. (2009); Hulth et al. (2009) Short phrases by users of search engines Search queries can be aggregated in terms of counts, allowing large-scale monitoring There may not be a direct correlation between a need for information and occurrence of an outbreak. Also, search queries may not be readily available.
News articles Doan et al. (2008); Yangarber et al. (2008); Lejeune et al. (2010); Freifeld et al. (2008) Well-formed text written by journalists High-quality NLP tools such as parsers, part-of-speech (POS) taggers can be used. News articles may contain latency because they may be written periodically.
Medical reports Crubézy et al. (2005); Conway et al. (2011); Aamer et al. (2016); Olszewski (2003) Text written by medical experts They represent reliable information reported by medical professionals. There may be privacy concerns to obtain these datasets. The short nature of the text may make it difficult to infer information.
Social media text Yepes et al. (2015); Adam et al. (2017); Lampos et al. (2017); Ofoghi et al. (2016) Short text written by users of social media The frequency and volume makes it an attractive source of data. The text may be noisy and unreliable. This may result in false signals.
Table 1: Summary of Unstructured Data Sources.

While medical ontologies use different representations and may encompass illnesses, they play a key role as a knowledge base for several epidemic intelligence approaches. It must be noted that medical ontologies capture medical names as well as colloquial names of symptoms. This becomes crucial because health mentions in informal text such as social media may not contain scientific/technical terms.

4.2 Datasets

In the previous section, we described ontologies that provide a structured background knowledge for epidemic intelligence. In this section, we present labeled datasets used for epidemic intelligence. We first describe the sources from where the data is obtained, and annotation strategies that have been employed to obtain annotations.

4.3 Data Sources

By data sources, we refer to different classes of textual content that may be used to create data for epidemic intelligence. Each of these data sources offers interesting opportunities and poses specific challenges. Velardi et al. (2014) categorise these sources as follows:

  1. Demand-based data sources: This refers to sources that reflect demand for information. A popular type of demand-based data source is search engines. In this case, the assumption would be that a large volume of search queries is likely to indicate a prevalent health risk in the community. However, demand-based data sources may not provide good estimates of health risks of interest. For example, in the case of search queries, not all searches may not be linked to a personal symptoms. In the wake of the outbreak of a disease, media coverage may result in fears in the minds of people resulting in a higher demand for information regarding the disease Alicino et al. (2015). In addition, demand-based data sources may not be readily available.

  2. Supply-based data sources: Supply-based sources are the ones where the data originates on large-scale platforms designed to share information. Examples of such platforms are discussion forums and social media. While such platforms may provide large-scale information, the text tends to be longer than search queries. Extraction of relevant textual items from a large pool is a key challenge with supply-based data sources. This has motivated the majority of the research in health mention classification, where labeled datasets from supply-based sources are used to learn systems for this classification.

Datasets originate from these two categories of sources, as shown in Table 1. They are:

  1. Search Queries: Search queries provide a large-scale view of interests of people express, along with reasonable anonymity Hulth et al. (2009). Therefore, search queries are an attractive data source for text-based epidemic intelligence. A seminal work by Ginsberg et al. (2009) describes a system which uses volumes of queries on Google, to detect disease outbreaks. Popular as Google Flu Trends, this system uses counts of search queries to predict influenza-like infections (ILI). However, in recent times, Google Flu Trends has received criticism for over-estimating flu counts due to modifications in Google’s search algorithms666http://time.com/23782/google-flu-trends-big-data-problems/.

  2. News Articles: News feed monitors based on RSS feeds (https://en.wikipedia.org/wiki/RSS) enable access to news websites in various languages. This allows monitoring them for news articles. Since news articles are typically written by professional writers, they adopt a formal style of writing. This makes it possible to use NLP tools such as semantic taggers and parsers to extract information about the health incidents. Early work in epidemic intelligence relies on news articles. Doan et al. (2008) present Global Health Monitor, a system that uses news feeds to identify health issues. This system uses a pipeline of NLP tools to identify health risks that communities may be facing. The system periodically scans 1500 news feeds for relevant content. Yangarber et al. (2008) also rely on news articles to detect health outbreaks. In each of these cases, a set of pre-determined keywords corresponding to a set of illnesses are used to monitor the news feeds. In addition to using news articles as a source of data, peculiarities of news articles can also be leveraged. Lejeune et al. (2010) show how news articles from multilingual sources can be monitored for health mentions. Their approach relies on a popular journalistic style where the head of a news article contains details of a health incident in terms of location, time, etc. Using a set of rules based on entities in the head and the body of a news article, they fill a template that corresponds to information about a health incident. Their experiments are conducted on news articles in English, Chinese and French. Freifeld et al. (2008) describe Health Map, a system that uses news reports to monitor diseases.

  3. Medical reports: Reports from medical sources have also been used as textual datasets. Chief complaints are primary reports created by emergency departments of hospitals Conway et al. (2013). They are often short strings that describe the medical condition of the patient when they first report to a hospital. On similar lines, Crubézy et al. (2005) use 2256 records in a medical record repository. Olszewski (2003) use a dataset of 28,990 triage diagnoses (ranging from 1 to 10 words in length). Aamer et al. (2016) use the SynSurv dataset that contains 2006 reports from two Melbourne hospitals.

  4. Social Media Text: Online social media such as Twitter allow users to post text. Availability of APIs to access Twitter have been useful. Yepes et al. (2015) state that social media provides targeted health information without the legal and technical obstacles that exist for other sources such as official records. However, Adam et al. (2017) point out that social media text may not be as credible or reliable as official records. Lampos et al. (2017) use a dataset of 35000 tweets posted over 449 weeks. Ofoghi et al. (2016) collect three datasets of tweets. The first two are related to an outbreak of Ebola: pre-event (i.e., tweets before an outbreak) and post-event (i.e., tweets after an outbreak). In addition to the two, they also use an Ebola background dataset. This dataset is about Ebola but is long before any outbreak occurred. Therefore the third dataset acts as a negative dataset which, although contains mentions of Ebola, is not related to a health outbreak. They obtain tweet-level manual annotations. Each tweet is labeled with emotion classes such as happiness, criticism, disgust and sarcasm.

  5. Combination: Some past work combines textual datasets from different data sources. This is either to validate that an approach holds for different text forms or for the information from multiple sources to supplement each other. Névéol et al. (2009) experiment with two datasets: 551 sentences from medical literature, and around 500 queries from the PubMed website. Woo et al. (2016) use data from multiple textual sources in Korean: search queries, social media data such as blogs, and correlate it with national influenza data.

4.4 Dataset Considerations

To create labeled datasets for health mention classification, datasets must be annotated with labels of interest. In such cases, one must consider: (a) what are the instructions to the annotators?; (b) how is the quality of annotations ensured/validated? It must be noted that the strategies described below are implemented in conjunction with each other, and not in isolation.

In terms of obtaining annotated datasets, appropriate guidelines to annotators is key. Aramaki et al. (2011) create a dataset of 5000 training tweets, manually labeled as positive/negative for the task of detecting health mentions. The annotator guidelines state that a tweet should be labeled positive if: (a) one or more people with flu exist around the tweet author; and (b) the tense is present or recent past. The authors also mandate that the tweet should be affirmative and not speculative (for example, ‘Seems like I might have flu’).

Despite the annotation guidelines, some annotators may not perform well due to various factors. To identify reliable annotators, Lamb et al. (2013) create a gold dataset of tweets whose labels are known. Then, manual annotations are then obtained for around 12000 tweets from multiple annotators. These tweets include the gold dataset. Annotations by annotators with greater than 60% accuracy on a gold dataset are retained. This annotation strategy is illustrated in Figure X (Figure will be made available in the full paper).

Because health event detection deals with a large volume of data generated over a period of time, it may not be possible to obtain annotation for all instances. Therefore, a combination of manual and automatic annotation has also been used. Figure X illustrates a typical combination. (Figure will be made available in the full paper) In general, the classifier is trained on a small set of manually annotated instances, and predictions are obtained on the complete dataset. These predictions are used as annotations for the dataset. Paul and Dredze (2011) download a set of 2 million tweets. A subset of 5,128 tweets are manually labeled. A classifier is trained on these manually labeled tweets and the predictions on the remaining tweets are obtained. These predictions are then used as labels for the tweets. A similar technique is used to obtain annotations for a large dataset of around 11 million tweets by Paul and Dredze (2012).

Sadilek et al. (2012) use a more sophisticated approach of obtaining a combination of manual and automatic labels. Since geolocations are crucial to their approach, they first identify 6237 users who have turned on geotagging. The annotation is then carried out as follows. As the first step, a set of tweets are manually labeled. Then, two classifiers are trained. The first classifier is trained with a high mis-classification cost for the majority class. This implies that it is expected to do well on the majority class. On the contrary, the second classifier has a high mis-classification cost for the minority class. Both the classifiers are used to obtain predictions on the unlabeled portion of the dataset. Predictions with high confidence by either of the two classifiers are added to the labeled dataset. Another method of combining automatic and manual annotation is used by Jiang et al. (2016) to create a dataset of personal health mentions. They employ an iterative algorithm that begins with a seed set of manually labeled instances. A classifier is trained on these labeled instances and predictions are obtained for an unlabeled set of instances. The labeled instances are then randomly selected for manual annotation. Following manual verification of the labels, these samples are added back to the seed set. The process repeats until the data imbalance is within an acceptable threshold.

Approach General Idea Motivation Challenges
Ontology-enhanced Collier et al. (2010); Huang et al. (2016); Lu et al. (2009); Crubézy et al. (2005); Conway et al. (2013) Given a text, map the terms in the text to appropriate concepts in the ontology to determine if a syndrome can be detected. Medical ontologies capture useful information in a structured form. (A) Ontologies may not be complete, (B) Ontologies may contain medical terms while the text may contain colloquial terms.
Similarity-based Freifeld et al. (2008); Aamer et al. (2016); Ofoghi et al. (2016) Similarity between distributions and similarity between concepts are used as indicators of an illness. A text that is similar to illness concepts/text is likely to be about the illness. The choice of similarity metric determines the benefit.
Topic Model-based Wang et al. (2014); Chen et al. (2016); Paul and Dredze (2012, 2011) With the help of datasets from social media topic models that are extensions of Latent Dirichlet Allocation (LDA) model have been proposed. With the use of additional latent variables, these models provide structured information about illnesses. Topic models can process unlabeled/partially labeled data and provide valuable information. Interpretation of generated topics and their application to Health mention classification may not be straightforward.
Pipeline-based Doan et al. (2008); Yangarber et al. (2008); Yepes et al. (2015); Yates et al. (2014) These approaches combine existing NLP components to build effective deployments. Typical components include named entity extraction and text classification. Health mention classification can be broken down into a sequence of NLP components fitting into one another. NLP components may be trained on documents in domains unrelated to health-care. In such cases, their efficacy for health mention classification needs validation.
Statistical Olszewski (2003); Aramaki et al. (2011); Chapman et al. (2005); Lamb et al. (2013); Kanouchi et al. (2015); Jiang et al. (2016) Features based on words, emotion scores, medical concepts and POS tags, along with typical classifier learning algorithms have been reported. Supervised classifiers trained on labeled datasets have been found to be useful in many applications of NLP. Selecting appropriate features and ensuring they generalise may be challenging.
Deep Learning-based Karisani and Agichtein (2018); Dai et al. (2017); Lampos et al. (2017); Wang et al. (2017)

Features based on word embeddings and modification of general-purpose word embeddings to the specific domain space have been reported, along with typical neural network models.

Deep learning approaches have proven to be useful since they do not rely on human-engineered features. Lack of availability of large labeled datasets may be an impediment.
Table 2: Summary of Approaches for Health Mention Classification.

5 Health Mention Classification

In the survey so far, we introduced the problem of epidemic intelligence, motivated it in terms of its challenges and then described the resources that can be used for epidemic intelligence. This section describes text-based approaches that have been reported for health mention classification. We classify past approaches in categories that are commonly used in computational linguistics, summarised in Table 2. We detail out these approaches in the forthcoming subsections.

5.1 Ontology-enhanced Approaches

In ontology-enhanced approaches, an ontology provides relevant medical knowledge that is used to make appropriate predictions. In general, an ontology can play the following roles in a epidemic intelligence system:

  • To extract entities of different types using patterns (for example, ‘X leads to Y’ can be used as a pattern to infer that X is a cause of an illness Y);

  • To identify diseases based on their common name, medical name or symptoms; and

  • To use inference rules from medical-domain ontologies to predict medical events of interest

An application that uses the BioCaster ontology is reported by Collier et al. (2010)

. Their system monitors news feeds for health-related news. They monitor news articles from multilingual sources. Based on the target keywords, relevant news articles are then translated to English. Following this, they use topic classification to further filter news relevant to the medical domain. For these, they then use information extraction techniques, such as named entity recognition or semantic role labeling, to construct relationships in the BioCaster ontology. Based on the relation tuples derived from the ontology, the system predicts public health events.

Crubézy et al. (2005) combine concepts in two ontologies by measuring relatedness between them. With this concept mapping, they classify a chief complaint into one of many syndrome categories by using a rule-based inference technique. A problem solver carries out the inference over the ontologies. They report that 44% of errors that were analysed are due to concept mapping not being found. Lu et al. (2009) use an ontology along with cross-lingual projections for classification of chief complaints in Chinese. A chief complaint is first pre-processed to account for stylistic properties of Chinese script and language. The words are projected to English using translation, and a chief complaint classifier trained on English documents is used for classification. Then, significant terms related to the medical domain in the complaint are identified and matched with those in an ontology. Conway et al. (2013) perform a review of chief complaint classification systems in North America. These systems use a combination of ontology-enhanced and statistical approaches. Huang et al. (2016) use a medical ontology which contains associative relationships between medical concepts, i.e., information on how these concepts related to each other. To use this ontology, if a word in a tweet is predicted as an entity of interest, it is mapped to a concept present the ontology using similarity values. The concepts themselves become the features of a classifier that detects an illness.

5.2 Similarity-based Approaches

Similarity-based approaches use notions of similarity to model syndromes. In general, the idea is to obtain semantic distances between words in a text and terms related to a syndrome of interest. Several similarity-based approaches have been reported. Freifeld et al. (2008)

use an N-gram-based approach that matches n-grams in a news report with those in a known dictionary of terms, based on semantic distances. Based on this matching, they classify every news report in terms of two parameters: primary location and disease name.

Aamer et al. (2016) present a semi-supervised algorithm that uses similarity to an illness-related concept as an illness-related indicator. They use Jenson-Shannon divergence to compute the similarity between terms for a dataset of chief complaints. In order to filter terms, they use a log-likelihood-based technique. Similarly, Ofoghi et al. (2016) use Naïve Bayes and lexicon-based approaches. They report Kullback Leibler (KL) divergence between emotion class distributions for pre-event and post-event datasets.

5.3 Topic Model-based Approaches

Topic models allow discovery of thematic concepts underlying large datasets. Latent Dirichlet Allocation (LDA) model is a popular topic model based on the assumption that a document is composed of a mixture of concepts (referred to as ‘topics’) Blei et al. (2003). While LDA models have been used to obtain themes underlying health-related datasets, there have been two extensions of the LDA model designed to understand aspects of syndromes. The first topic model is by Paul and Dredze (2012), called the Aspect Topic Ailment Model (ATAM). The model includes a latent label each for: (a) switching between general or health-related words; (b) identifying background words; and (c) an ailment. Using an observed label to select between ailment, treatment and general health-related words, the model discovers topics corresponding to ailments. A stochastic version of the Expectation-Maximisation algorithm is used for the estimation of latent variables in the model, to maximise the likelihood of the data. While this work focuses on flu, an extension of this work by Paul and Dredze (2011) reports findings on a wider range of symptoms. Wang et al. (2014) use the ATAM to extract topics from Chinese micro-blogs, and discover topics corresponding to health. The second extension of a LDA model is by Chen et al. (2016). This model is called the Hidden Flu-State Tweet Model (HFSTM). It uses the Twitter timeline of a user as a temporal series. For each tweet, the model estimates the health state of a user as: healthy, exposed and infected. It uses latent variables similar to the ATAM: (a) a word-level variable that indicates background words, (b) a word-level variable that indicates general domain words, (c) a word-level switch variable between symptom and general words, and (d) a tweet-level symptom variable. The symptom variable for a tweet, in addition to the local word-level dependencies, depends on the symptom variable of the previous tweet in a sequence. This way, the model incorporates temporal property in Twitter timelines. In the case of all approaches, the datasets are created using symptom keywords, so as to ensure that the topics are relevant. The following holds for both these topic model-based approaches:

  • Additional latent variables and dependencies are constructed to incorporate semantic relationships between types of word clusters. These relationships may be in the form of symptoms, ailments and medication for a syndrome, where symptoms, ailments and medication have topics corresponding to each.

  • If a list of words indicating infections and another indicating medicines are available, asymmetric priors may be set on these words for a certain set of topics. This appears to be helpful to discover other unrelated symptoms, if a smaller set of symptoms is known based on medical expertise.

5.4 Pipeline-based Approaches

The next category of approaches is the pipeline-based approaches. We call them such because these approaches present solutions in the form of a pipeline of computational linguistic modules. Some pipeline-based approaches for epidemic intelligence are as follows. Doan et al. (2008) use a three-step approach in their news monitoring system: (a) Topic classification using a Naïve Bayes classifier first identifies if a tweet is health-related, (b) Named entity recognition is used to extract entities. (c) Disease and location detection to extract these terms, (d) Visualisation to represent the news on a map. Yangarber et al. (2008) describe a media monitor for healthcare called MedISys. The system is as an information retrieval engine for medical domain, as a part of the Europe Media Monitor. The system operates as follows: (i) It first searches news articles from feeds. The news articles are then categorised as health-related or not. (ii) For the news articles that are predicted as health-related, they run an information extraction system called PULS. PULS uses a pattern-based technique to extract incidents. (iii) MedISys eliminates redundancies because of the same incident being reported at multiple places. An incident is defined by four sets of attributes: (a) Location, (b) Disease name, (c) Date/Period, (d) Victim information. Victim information is characterised by features such as Type (human, animal), number, whether survived or not. Yepes et al. (2015) describe a system that identifies tweets related to illnesses of interest, and then places the tweet on a map. The pipeline consists of the following modules: (i) Medical Named Entity Recognition (NER) tagger (using a Conditional Random field (CRF), labels words as one of disease, symptoms and pharmacological substances), (ii) Geotagger (If a Global Positioning System (GPS) location is not present, it uses a gazetted list and the user profile location to tag tweets with location). Yates et al. (2014) describe a framework for epidemic intelligence using social media. Their framework also consists of a pipeline of three steps: concept extraction, concept aggregation and trend detection. In the concept extraction step, they use taggers for named entity recognition and information extraction. In concept aggregation, they identify relationships between the concepts extracted in the previous step. To do so, they use an approach based on word sense disambiguation where different concept words map to the same concept. The third step is trend detection where they plot these counts on a timeline. In general, a typical pipeline-based approach consists of:

  • Use an information extraction tool to get terms of interest.

  • Train machine learning models for relevant predictions.

  • Map these to appropriate data structures.

5.5 Statistical Approaches

Statistical approaches model health mention classification as a supervised classification problem. In order to describe these approaches, we consider two parameters: (a) the learning algorithm, and (b) the features used to represent an instance777In this case, we consider as statistical, the approaches that require feature engineering. Deep learning-based approaches are covered in the next subsection..

Olszewski (2003) use a Naïve Bayes classifier for prediction of class of illnesses. They use unigrams and bigrams as features of the classifier. Chapman et al. (2005) use a probabilistic Bayesian parser to generate semantic frames from chief complaints. These semantic frames are then used to predict presence of a set of diseases. Aramaki et al. (2011)

use features based on bag of words with feature windows of multiple sizes. They report results on a variety of classifiers such as AdaBoost, Naïve Bayes, and support vector machines (SVM).

Lamb et al. (2013) propose features for the task of infection report detection such as: (a) manually-created set of word classes, (b) tense, person, (c) words indicating concern and awareness, (d) POS n-grams, (e) emoticons, (f) tuples of subject, object, verb combinations. Névéol et al. (2009) perform disease detection in medical literature and search queries. They use a technique called the priority model which assigns probabilistic estimates for every query term to belong to either of the two classes: disease mention and disease non-mention. Kanouchi et al. (2015) use features such as unigrams, weblinks, word classes, length, ngram and retweets for identification of target in tweets that mention health concerns. Jiang et al. (2016) train a classifier that predicts if a given tweet is a personal health experience. Towards this, they use features such as emotion words, emotion scores, user mentions, and number of first/second/third person pronoun mentions.

5.6 Deep learning-based Approaches

The last class of approaches uses neural models based on deep learning. Deep learning allows the discovery of underlying task-relevant semantics without use of human-engineered features. At the heart of deep learning are distributional representations known as ‘embeddings’. Learned from neural models, a word embedding captures the semantic properties of the word.

Lampos et al. (2017) use word embedding distances to select unigram features for flu detection. They use two types of word embeddings: (i) embeddings from wikipedia articles, and (ii) embeddings learned from 215 million tweets from 2014-16. Dai et al. (2017) use word embeddings to create concept clusters that are then used to classify tweets as flu-related or not. Specifically, they compute disease vectors based on related terms. In order to make a prediction, they create semantic clusters of words using word embeddings such that, for every word, the algorithm randomly chooses between: creating a new cluster, or to adding it to an existing cluster. If any cluster in a tweet is within a threshold of distance to the disease clusters, the tweet is predicted as flu-related. The clustering-based approach is compared against Naïve Bayes classifiers. On similar lines, Karisani and Agichtein (2018) present a word embeddings space partitioning and distortion (WESPAD) model for detecting if a given tweet contains a personal health mention. Towards this, they augment word embeddings to other traditional features, and show that the use of word embeddings result in an improvement. However, since the authors believe that word embeddings may not be well-separated for the task, they suggest two innovations: (a) Partitioning: depending on confidence values for each class, they partition the space of embeddings. For each partition, they add two additional features for positive and negative class. (b) Distortion: instead of averaging word vectors to get sentence vectors, they create the sentence vectors by applying info-gain-based weighting to the word vectors. Finally, Wang et al. (2017)

employ an architecture based on recurrent neural networks (RNNs) and compare it with statistical baselines based on classifiers such as SVM.

5.7 Shared Tasks

In the context of text-based epidemic intelligence, the following shared tasks have been conducted. Adam et al. (2017) describe a hackathon called ZikaHack held in 2016. The objective was to perform a retrospective analysis of the Zika (also known as the microcephaly virus) outbreak based on different social media sources. The authors describe the systems that participated in the competition. These systems use textual data from sources such as Twitter, Facebook, Instagram, Google maps, Wikipedia, Government reports, in addition to structured data based on medical counts. The winning system uses translation systems to collect information from multilingual sources, and change point detection algorithms to identify outbreaks. Sarker et al. (2016) describe three related tasks: adverse drug reaction detection, drug reaction type classification, and drug consumption normalisation. The competition reported in Weissenbacher et al. (2018) describes four shared tasks: drug mention detection, medication intake classification, adverse drug reaction detection and vaccination behaviour detection.

Approach General Idea Motivation Challenges
Temporal Outbreaks Ginsberg et al. (2009); Sparks et al. (2017); Hayate et al. (2016); Velardi et al. (2014); Aamer et al. (2016); Huang et al. (2016) Time series analysis is performed on text bearing timestamps. Health events such as epidemic outbreaks can be detected using an unexpected rise of fall in text with certain content. A spike may not necessarily correspond to an event. Stigma about certain illnesses may prevent people from writing about it.
Spatial Outbreaks  Sadilek et al. (2012); Chapman et al. (2005); Ofoghi et al. (2016); Gomide et al. (2011); Shao et al. (2017) Location information may be filter relevant text from a region or as features that account for location. Location information in text can be helpful to identify possible locations of outbreaks or focus on regions of interest. (A) Location may not always be available, (B) Outbreaks may often capture interest about the illness around the world, without an actual outbreak in that part of the world.
Table 3: Summary of Approaches for Health Event Detection.

6 Health Event Detection

In the previous section, we described approaches that look at the detection of symptoms and syndromes. We referred to them as health mention classification. The second crucial component of epidemic intelligence is health event detection. This refers to approaches that have been used to predict events from a collection of textual units, in terms of time and space. Table 3 summarises the approaches for health event detection.

6.1 Temporal Outbreaks

Detecting a temporal outbreak involves detecting anomalies in the trend of a sequence of textual units. This means that the textual units would need to bear timestamps. Also, the prediction for individual textual units need not be accurate as long as the overall distribution sufficiently points towards an outbreak. There are two broad approaches dealing with temporal outbreak detection: prediction of infection counts (where a number is predicted) and outbreak detection (where the algorithm needs to predict known events).

The first set of approaches predict infection counts from series of textual data. Ginsberg et al. (2009) use search counts of the 45 million top queries across a subset of states from the US. Weekly counts of top queries are normalised by the total number of queries. These counts are stored for every week. Then, they use a model that is trained to predict the influenza-like illness (ILI) counts, given the search query proportions. Sparks et al. (2017) predict tweet counts using a Poisson regression model that uses features such as hour, day of week, day number in a sequence, as seen in past data. They use process control techniques to detect events depending on whether actual tweet counts are within acceptable range of the predicted/expected tweet counts. Woo et al. (2016) use support vector regression to predict influenza counts using numerical features derived from keyword mentions in textual data such as blogs and search queries. Hayate et al. (2016) use a frequency-based approach that factors time lag for different words. For example, they observe that the word ‘injection’ lags behind actual outcome of influenza much more than the word ‘fever’. Therefore, they construct the word-day frequency matrix and shift the vector of a word so as to maximise its cross-correlation with the reported flu counts.

The second set of approaches aim to predict known outbreaks. Aamer et al. (2016) perform experiments that consider three formulations: (i) Intersecting seven-day windows, (ii) Disjoint 7-day windows, (iii) Disjoint 1-day windows. They experiment with three degrees of sensitivity - depending on how much the divergence can be. The degrees of sensitivity show the intensity of a likely outbreak. Similarly, Huang et al. (2016) show how the prevalence of flu and Lyme can be detected one week ahead of reported CDC data. They download English tweets of interest based on keywords from a pre-determined period. Then, they perform named entity recognition to label text with the named entities. These named entities are then labeled for one among disorders, symptoms, and pharmacological substances. The entities in a tweet are mapped to corresponding clusters using an ontology. This allows for medical names and multiple common names to be mapped to the same feature. Velardi et al. (2014)

use a two-step approach. As the first step, they employ a term extraction algorithm. This algorithm starts with a seed set of technical words and symptom words. This seed set is then iteratively expanded using pattern-matching. This is first done on Google snippets, wikipedia and a medical corpus where related technical words and symptom words are learned. Then, the matching step is repeated on micro-blogs where only symptom words are learned. When the terms are extracted, they count social media mentions of these words as indicators of a syndromic outbreak.

6.2 Spatial Outbreaks

Specifications regarding space have also been considered for health event detection. This may be done either to focus on a certain geographical region, or to use such information as an additional context. To focus on multiple regions, Zou et al. (2018) use multi-task learning where different geographical regions constitute tasks in a disease prediction problem. Multi-task learning allows correlation between geographical regions.

The region from which datasets are sourced often impacts performance of systems. For example, Chapman et al. (2005) train their system on chief complaints from Pennsylvania, and test it on those from Utah. Ofoghi et al. (2016) use a negative dataset from Australia while a positive dataset is from London. Sadilek et al. (2012) predict individual infections based on: indicators within text and geo-tagging in terms of locations and co-locations with people and infected people. Therefore, they use a CRF to make their estimations. In addition to textual features, each observed instance in the CRF is characterised by: (a) number of collocations in past 7 days with anyone, (b) number of collocations in past seven days with people who have reported illnesses, and (c) number of collocations in past seven days with friends who have reported illnesses. Gomide et al. (2011) estimate the counts of dengue incidences in different geographical areas for a dataset of tweets from Brazil. Then, depending on these counts, they use a spatial clustering algorithm that creates clusters of cities depending on their physical proximity and number of dengue incidences. This approach is helpful to understand the spread of the disease. On the other hand, Shao et al. (2017) use co-mentions in tweets to create a social network of users. They then apply scan statistics to identify health events in the network. In this case, the notion of space is a virtual network on social media.

7 Evaluation

This section presents evaluation methodologies used to validate the performance of epidemic intelligence. Evaluation using manually labeled datasets is common, as is the case for most supervised classification tasks. In addition, the following trends emerge in past work in terms of evaluation methodologies:

  1. Correlation with publicly available health data: Health data may be publicly available in terms of counts of infections or known health outbreaks. Either of the two can be used to evaluate epidemic intelligence. In general, the framework to evaluate epidemic intelligence comprises following steps Lampos et al. (2017); Lamb et al. (2013):

    1. To evaluate health mention classification, they report classification performance on a labeled dataset, annotated using manual, automatic or combined annotation strategies.

    2. This may be followed by correlation with publicly available counts for infections. This is often done by training appropriate regression models which predict infection counts. Alternatively, the outbreaks returned by health event detection could be compared against health events that have been known to happen from other sources such as news.

    Chen et al. (2016) use Pan American Health Organization (PAHO) case counts to validate their predictions. Aamer et al. (2016) report the number of incidents and top terms discovered by three sliding window configurations. Velardi et al. (2014) identify a set of terms of health-related words and download tweets containing these words. These counts are then correlated with ILI counts. Huang et al. (2016) experiment with Lyme and Flu, and show the correlation for two series of CDC counts: counts of the current week and counts of the next week. They show that outbreaks can be predicted a week in advance using tweet streams. Aramaki et al. (2011) first evaluate the classification performance on 10-fold cross-validation. They then correlate the Google Flu trends with the disease outbreaks as predicted for the test datasets. They observe that excessive news reporting may lead to false alerts from social media. Similarly, Pervaiz et al. (2012)

    present a comparison of three classes of algorithms for epidemic detection from Google Flu Trends: normal distribution algorithms, Poisson distribution algorithms and negative binomial distribution algorithms.

  2. Validation on multiple datasetsOfoghi et al. (2016) experiment with two datasets: a positive dataset which contains a syndromic outbreak and a negative dataset which is from a year before the outbreak. They then validate if the outbreak in the positive dataset gets detected, as well as no outbreak is detected in the negative dataset. Similarly, to validate their approach of distorting and partitioning word embeddings, Karisani and Agichtein (2018) report their results on multiple illnesses. Hayate et al. (2016) create three datasets for three seasons/spells of influenza and show correlation with the reported flu counts. They train the model on one season while testing on another. Yates et al. (2014) report their observations for two medical conditions: allergies and flu.

  3. Evaluation of components: Often, epidemic intelligence may consist of components. This is typical in the case of pipeline-based approaches. Therefore, Yepes et al. (2015), a pipeline-based approach, report performance of every stage of the pipeline, namely, (i) Performance of medical NER and geotagger on a manually labeled dataset, (ii) Top medical terms in tweets that are extracted, (iii) seven day rolling average for three cities, namely, New York, London and Chicago. Components may also be added to a downstream task. For example, Kanouchi et al. (2015) present an approach to identify target of a personal health mention. To validate that this is useful, they interface it with a downstream task of health event detection (referred to as ‘episode prediction’ in their paper). They show that performing target identification before a personal health mention detection results in an improved performance of health event detection.

  4. Qualitative evaluation: Ginsberg et al. (2009) reports top topics in health-related search queries across different localities. Similarly,  Paul and Dredze (2011) describe topics generated by the topic model inx terms of word clusters.

8 Open Research Avenues

Based on this survey, we now identify the following research directions for epidemic intelligence using computational linguistic techniques. These avenues represent enhancements in three directions: (a) Quality of the surveillance output (by mitigating false alerts), (b) Timeliness of the surveillance output (to achieve near-real-time indicators), and (c) Coverage of the system (in terms of factoring relationships between symptoms/syndromes).

  1. Mitigation of False Alerts: Ginsberg et al. (2009) observe possible false alerts due to reliance on social media. This implies that improved precision without impacting recall is a useful research avenue for epidemic intelligence. Towards this, we note two possibilities. The first possibility is in terms of detection of spam. Due to possibly malicious intent of social media users, false information may be published in social media posts. The second possibility is in terms of figurative language. Since many symptoms have figurative usages (‘my naughty kids almost gave me a heart-attack’), separating figurative language from literal language may prove to be beneficial. Additional checks such as these could be incorporated into existing epidemic intelligence approaches, to avoid false alerts in general.

  2. Opportunities for Near-Real-time Indicators: Epidemic intelligence can be useful to manage ambulance networks in times of an outbreak Sparks et al. (2010a). It would be useful to investigate if social media text provides real-time signals to identify outbreaks at sub-daily intervals. Velasco et al. (2014) enlist challenges in integration of event-based techniques for social media surveillance.

  3. Overlapping Syndromes and Symptoms: Past work considers different illness/syndromes as merely different datasets or systems against which experiments are to be run. However, many of these syndromes may be related to each other. An interesting future direction is to consider how syndromes are similar to one another in terms of their symptoms. It follows that epidemic intelligence for physical illnesses may be combined with mental health surveillance or animal health surveillance. The former is crucial since mental health conditions may involve physical symptoms, the latter assumes importance due to zoonotic diseases that may be transferred from animals to humans. An initial work in the direction of animal monitoring is by Welvaert et al. (2017) who use social media as a monitoring tool for exotic species.

9 Conclusions

Text-based epidemic intelligence has received attention due to the information and timeliness of textual data on the web. Techniques involving different levels of sophistication of computational linguistic approaches have been reported. In this paper, we survey these past approaches. We first introduce textual datasets, highlighting the strengths and challenges in each. We note that ontologies that capture medical concepts have been valuable for text-based epidemic intelligence. Also, since social media is an accessible medium today, social media text such as tweets also provide an opportunity for text-based epidemic intelligence.

We then view past work in terms of health mention classification (which deals with detecting syndromes in individual textual units) and health event detection (which deals with detecting outbreaks using a collection of textual units). Health mention classification techniques may use ontologies, pipelines of NLP components, statistical classifiers with task-specific features or neural network architectures. Advances in natural language processing and machine learning have been employed for newer approaches of health mention classification. In terms of health event detection, we investigate how large volumes of text have been used to detect health events relevant to a community, and how geographical information has been used to fine-tune these predictions. Based on our survey, we believe that avenues for future work in this area lie in terms of improving the quality (by mitigating false alerts), the efficiency (by making health event detection as real-time as possible) and the coverage of epidemic intelligence (by combining related symptoms).

We hope that our computational linguistic perspective to epidemic intelligence serves as a useful resource for computational linguists and health practitioners alike.

References

  • Aamer et al. (2016) Hafsah Aamer, Bahadorreza Ofoghi, and Karin Verspoor. 2016. Syndromic surveillance through measuring lexical shift in emergency department chief complaint texts. In Proceedings of the Australasian Language Technology Association Workshop 2016, pages 45–53.
  • Adam et al. (2017) Dillon C Adam, Jitendra Jonnagaddala, Daniel Han-Chen, Sean Batongbacal, Luan Almeida, Jing Z Zhu, Jenny J Yang, Jumail M Mundekkat, Steven Badman, Abrar Chughtai, et al. 2017. Zikahack 2016: A digital disease detection competition. In Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017), pages 39–46.
  • Al-garadi et al. (2016) Mohammed Ali Al-garadi, Muhammad Sadiq Khan, Kasturi Dewi Varathan, Ghulam Mujtaba, and Abdelkodose M Al-Kabsi. 2016. Using online social networks to track a pandemic: A systematic review. Journal of biomedical informatics, 62:1–11.
  • Alicino et al. (2015) Cristiano Alicino, Nicola Luigi Bragazzi, Valeria Faccio, Daniela Amicizia, Donatella Panatto, Roberto Gasparini, Giancarlo Icardi, and Andrea Orsi. 2015. Assessing ebola-related web search behaviour: insights and implications from an analytical study of google trends-based query volumes. Infectious diseases of poverty, 4(1):54.
  • Aramaki et al. (2011) Eiji Aramaki, Sachiko Maskawa, and Mizuki Morita. 2011. Twitter catches the flu: detecting influenza epidemics using twitter. In Proceedings of the conference on empirical methods in natural language processing, pages 1568–1576. Association for Computational Linguistics.
  • Benton et al. (2017) Adrian Benton, Glen Coppersmith, and Mark Dredze. 2017. Ethical research protocols for social media health research. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 94–102.
  • Bernardo et al. (2013) Theresa Marie Bernardo, Andrijana Rajic, Ian Young, Katie Robiadek, Mai T Pham, and Julie A Funk. 2013. Scoping review on search queries and social media for disease surveillance: a chronology of innovation. Journal of medical Internet research, 15(7).
  • Bertaud-Gounot et al. (2012) Valérie Bertaud-Gounot, Régis Duvauferrier, and Anita Burgun. 2012. Ontology and medical diagnosis. Informatics for Health and Social Care, 37(2):51–61.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
  • Bodenreider (2004) Olivier Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270.
  • Boyle et al. (2011) Justin R Boyle, Ross S Sparks, Gerben B Keijzers, Julia L Crilly, James F Lind, and Louise M Ryan. 2011. Prediction and surveillance of influenza epidemics. Medical Journal of Australia, 194(4):S28.
  • Brownstein et al. (2009) John S Brownstein, Clark C Freifeld, and Lawrence C Madoff. 2009. Digital disease detection—harnessing the web for public health surveillance. New England Journal of Medicine, 360(21):2153–2157.
  • Chapman et al. (2005) Wendy W Chapman, Lee M Christensen, Michael M Wagner, Peter J Haug, Oleg Ivanov, John N Dowling, and Robert T Olszewski. 2005. Classifying free-text triage chief complaints into syndromic categories with natural language processing. Artificial Intelligence in Medicine, 33(1):31–40.
  • Charles-Smith et al. (2015) Lauren E Charles-Smith, Tera L Reynolds, Mark A Cameron, Mike Conway, Eric HY Lau, Jennifer M Olsen, Julie A Pavlin, Mika Shigematsu, Laura C Streichert, Katie J Suda, et al. 2015. Using social media for actionable disease surveillance and outbreak management: a systematic literature review. PloS one, 10(10):e0139701.
  • Chen et al. (2016) Liangzhe Chen, KSM Tozammel Hossain, Patrick Butler, Naren Ramakrishnan, and B Aditya Prakash. 2016. Syndromic surveillance of flu on twitter using weakly supervised temporal topic models. Data Mining and Knowledge Discovery, 30(3):681–710.
  • Collier et al. (2007) N Collier, A Kawazoe, L Jin, M Shigematsu, D Dien, et al. 2007. The biocaster ontology: A multilingual ontology for infectious disease outbreak surveillance: Rationale, design and challenges. Journal of Language Resources and Evaluation, pages 405–413.
  • Collier et al. (2010) Nigel Collier, Reiko Matsuda Goodwin, John McCrae, Son Doan, Ai Kawazoe, Mike Conway, Asanee Kawtrakul, Koichi Takeuchi, and Dinh Dien. 2010. An ontology-driven system for detecting global health events. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 215–222. Association for Computational Linguistics.
  • Conway et al. (2011) Mike Conway, John Dowling, and Wendy Chapman. 2011. Developing an application ontology for mining free text clinical reports: the extended syndromic surveillance ontology. In 3rd International Workshop on Health Document Text Mining and Information Analysis (LOUHI 2011), pages 75–82. Citeseer.
  • Conway et al. (2013) Mike Conway, John N Dowling, and Wendy W Chapman. 2013. Using chief complaints for syndromic surveillance: a review of chief complaint based classifiers in north america. Journal of Biomedical Informatics, 46(4):734–743.
  • Crubézy et al. (2005) Monica Crubézy, Martin O’Connor, Zachary Pincus, Mark A Musen, and David L Buckeridge. 2005. Ontology-centered syndromic surveillance for bioterrorism. IEEE Intelligent Systems, 20(5):26–35.
  • Dai et al. (2017) Xiangfeng Dai, Marwan Bikdash, and Bradley Meyer. 2017. From social media to public health surveillance: Word embedding based clustering method for twitter classification. In SoutheastCon, 2017, pages 1–7. IEEE.
  • Doan et al. (2008) Son Doan, Ai Kawazoe, Nigel Collier, et al. 2008. Global health monitor-a web-based system for detecting and mapping infectious diseases. In Proceedings of the Third International Joint Conference on Natural Language Processing.
  • Freifeld et al. (2008) Clark C Freifeld, Kenneth D Mandl, Ben Y Reis, and John S Brownstein. 2008. Healthmap: global infectious disease monitoring through automated classification and visualization of internet media reports. Journal of the American Medical Informatics Association, 15(2):150–157.
  • Fung et al. (2015) Isaac Chun-Hai Fung, Zion Tsz Ho Tse, and King-Wa Fu. 2015. The use of social media in public health surveillance. Western Pacific surveillance and response journal: WPSAR, 6(2):3.
  • Ginsberg et al. (2009) Jeremy Ginsberg, Matthew H Mohebbi, Rajan S Patel, Lynnette Brammer, Mark S Smolinski, and Larry Brilliant. 2009. Detecting influenza epidemics using search engine query data. Nature, 457(7232):1012.
  • Gomide et al. (2011) Janaína Gomide, Adriano Veloso, Wagner Meira Jr, Virgílio Almeida, Fabrício Benevenuto, Fernanda Ferraz, and Mauro Teixeira. 2011. Dengue surveillance based on a computational model of spatio-temporal locality of twitter. In Proceedings of the 3rd International Web Science Conference, page 3. ACM.
  • Gruber (1993) Thomas R Gruber. 1993. A translation approach to portable ontology specifications. Knowledge acquisition, 5(2):199–220.
  • Hayate et al. (2016) ISO Hayate, Shoko Wakamiya, and Eiji Aramaki. 2016. Forecasting word model: Twitter-based influenza surveillance and prediction. In Proceedings of the 26th International Conference on Computational Linguistics, pages 76–86.
  • Henning (2004) Kelly J Henning. 2004. What is syndromic surveillance? Morbidity and Mortality Weekly Report, pages 7–11.
  • Hopkins et al. (2017) Richard S Hopkins, Catherine C Tong, Howard S Burkom, Judy E Akkina, John Berezowski, Mika Shigematsu, Patrick D Finley, Ian Painter, Roland Gamache, Victor J Del Rio Vilas, et al. 2017. A practitioner-driven research agenda for syndromic surveillance. Public Health Reports, 132(1_suppl):116S–126S.
  • Huang et al. (2016) Pin Huang, Andrew MacKinlay, and Antonio Jimeno Yepes. 2016. Syndromic surveillance using generic medical entities on twitter. In Proceedings of the Australasian Language Technology Association Workshop 2016, pages 35–44.
  • Hulth et al. (2009) Anette Hulth, Gustaf Rydevik, and Annika Linde. 2009. Web queries as a source for syndromic surveillance. PloS one, 4(2):e4378.
  • Jiang et al. (2016) Keyuan Jiang, Ricardo Calix, and Matrika Gupta. 2016. Construction of a personal experience tweet corpus for health surveillance. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing, pages 128–135.
  • Joshi et al. (2017) Aditya Joshi, Pushpak Bhattacharyya, and Sagar Ahire. 2017. Sentiment resources: Lexicons and datasets. In

    A Practical Guide to Sentiment Analysis

    , pages 85–106. Springer.
  • Kanouchi et al. (2015) Shin Kanouchi, Mamoru Komachi, Naoaki Okazaki, Eiji Aramaki, and Hiroshi Ishikawa. 2015. Who caught a cold?-identifying the subject of a symptom. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pages 1660–1670.
  • Karimi et al. (2015) Sarvnaz Karimi, Chen Wang, Alejandro Metke-Jimenez, Raj Gaire, and Cecile Paris. 2015. Text and data mining techniques in adverse drug reaction detection. ACM Computing Surveys, 47(4):56.
  • Karisani and Agichtein (2018) Payam Karisani and Eugene Agichtein. 2018. Did you really just have a heart attack?: Towards robust detection of personal health mentions in social media. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 137–146. International World Wide Web Conferences Steering Committee.
  • Lamb et al. (2013) Alex Lamb, Michael J Paul, and Mark Dredze. 2013. Separating fact from fear: Tracking flu infections on twitter. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 789–795.
  • Lampos et al. (2017) Vasileios Lampos, Bin Zou, and Ingemar Johansson Cox. 2017.

    Enhancing feature selection using word embeddings: The case of flu surveillance.

    In Proceedings of the 26th International Conference on World Wide Web, pages 695–704. International World Wide Web Conferences Steering Committee.
  • Larsen et al. (2015) Mark E Larsen, Tjeerd W Boonstra, Philip J Batterham, Bridianne O’Dea, Cecile Paris, and Helen Christensen. 2015. We feel: mapping emotion on twitter. IEEE journal of Biomedical and Health Informatics, 19(4):1246–1252.
  • Lejeune et al. (2010) Gaël Lejeune, Antoine Doucet, Roman Yangarber, and Nadine Lucas. 2010. Filtering news for epidemic surveillance: towards processing more languages with fewer resources. In 4th International workshop on cross-lingual information access, pages 8–pages.
  • Lindberg et al. (1993) Donald AB Lindberg, Betsy L Humphreys, and Alexa T McCray. 1993. The unified medical language system. Methods of information in medicine, 32(04):281–291.
  • Lu et al. (2009) Hsin-Min Lu, Hsinchun Chen, Daniel Zeng, Chwan-Chuen King, Fuh-Yuan Shih, Tsung-Shu Wu, and Jin-Yi Hsiao. 2009. Multilingual chief complaint classification for syndromic surveillance: an experiment with chinese chief complaints. International Journal of Medical Informatics, 78(5):308–320.
  • Manning et al. (1999) Christopher D Manning, Christopher D Manning, and Hinrich Schütze. 1999. Foundations of statistical natural language processing. MIT press.
  • Névéol et al. (2009) Aurélie Névéol, Won Kim, W John Wilbur, and Zhiyong Lu. 2009. Exploring two biomedical text genres for disease recognition. In Proceedings of the Workshop on current trends in Biomedical Natural Language Processing, pages 144–152. Association for Computational Linguistics.
  • Ofoghi et al. (2016) Bahadorreza Ofoghi, Meghan Mann, and Karin Verspoor. 2016. Towards early discovery of salient health threats: A social media emotion classification technique. In Biocomputing 2016: Proceedings of the Pacific Symposium, pages 504–515. World Scientific.
  • Okhmatovskaia et al. (2009) A Okhmatovskaia, W Chapman, N Collier, J Espino, and DL Buckeridge. 2009. Sso: the syndromic surveillance ontology. In Proceedings of the International Society for Disease Surveillance.
  • Olszewski (2003) Robert T Olszewski. 2003. Bayesian classification of triage diagnoses for the early detection of epidemics. In International Florida Artificial Intelligence Research Society Conference, pages 412–416.
  • Paul and Dredze (2011) Michael J Paul and Mark Dredze. 2011. You are what you tweet: Analyzing twitter for public health. In International AAAI Conference on Web and Social Media, volume 20, pages 265–272.
  • Paul and Dredze (2012) Michael J Paul and Mark Dredze. 2012. A model for mining public health topics from twitter. Health, 11:16–6.
  • Pervaiz et al. (2012) Fahad Pervaiz, Mansoor Pervaiz, Nabeel Abdur Rehman, and Umar Saif. 2012. Flubreaks: early epidemic detection from google flu trends. Journal of medical Internet research, 14(5).
  • Rosewell et al. (2013) Alexander Rosewell, Berry Ropa, Heather Randall, Rosheila Dagina, Samuel Hurim, Sibauk Bieb, Siddhartha Datta, Sundar Ramamurthy, Glen Mola, Anthony B Zwi, et al. 2013. Mobile phone–based syndromic surveillance system, papua new guinea. Emerging Infectious Diseases, 19(11):1811.
  • Sadilek et al. (2012) Adam Sadilek, Henry A Kautz, and Vincent Silenzio. 2012. Predicting disease transmission from geo-tagged micro-blog data. In Conference on Artificial Intelligence (AAAI), pages 136–142.
  • Sarker et al. (2015) Abeed Sarker, Rachel Ginn, Azadeh Nikfarjam, Karen O’Connor, Karen Smith, Swetha Jayaraman, Tejaswi Upadhaya, and Graciela Gonzalez. 2015. Utilizing social media data for pharmacovigilance: a review. Journal of biomedical informatics, 54:202–212.
  • Sarker et al. (2016) Abeed Sarker, Azadeh Nikfarjam, and Graciela Gonzalez. 2016. Social media mining shared task workshop. In Biocomputing 2016: Proceedings of the Pacific Symposium, pages 581–592. World Scientific.
  • Shao et al. (2017) Minglai Shao, Jianxin Li, Feng Chen, Hongyi Huang, Shuai Zhang, and Xunxun Chen. 2017. An efficient approach to event detection and forecasting in dynamic multivariate social media networks. In Proceedings of the 26th International Conference on World Wide Web, pages 1631–1639. International World Wide Web Conferences Steering Committee.
  • Sparks et al. (2010a) Ross Sparks, Chris Carter, Petra Graham, David Muscatello, Tim Churches, Jill Kaldor, Robyn Turner, Wei Zheng, and Louise Ryan. 2010a. Understanding sources of variation in syndromic surveillance for early warning of natural or intentional disease outbreaks. IIE Transactions, 42(9):613–631.
  • Sparks et al. (2010b) Ross Sparks, Tim Keighley, and David Muscatello. 2010b. Exponentially weighted moving average plans for detecting unusual negative binomial counts. IIE Transactions, 42(10):721–733.
  • Sparks et al. (2017) Ross S Sparks, Bella Robinson, Robert Power, Mark Cameron, and Sam Woolford. 2017. An investigation into social media syndromic monitoring. Communications in Statistics-Simulation and Computation, 46(8):5901–5923.
  • Velardi et al. (2014) Paola Velardi, Giovanni Stilo, Alberto E Tozzi, and Francesco Gesualdo. 2014. Twitter mining for fine-grained syndromic surveillance. Artificial Intelligence in Medicine, 61(3):153–163.
  • Velasco et al. (2014) Edward Velasco, Tumacha Agheneza, Kerstin Denecke, Goeran Kirchner, and Tim Eckmanns. 2014. Social media and internet-based data in global systems for public health surveillance: A systematic review. The Milbank Quarterly, 92(1):7–33.
  • Wagner et al. (2011) Michael M Wagner, Andrew W Moore, and Ron M Aryel. 2011. Handbook of Biosurveillance. Elsevier.
  • Wang et al. (2017) Chen-Kai Wang, Onkar Singh, Zhao-Li Tang, and Hong-Jie Dai. 2017. Using a recurrent neural network model for classification of tweets conveyed influenza-related information. In Proceedings of the International Workshop on Digital Disease Detection using Social Media 2017 (DDDSM-2017), pages 33–38.
  • Wang et al. (2014) Shiliang Wang, Michael J Paul, and Mark Dredze. 2014. Exploring health topics in chinese social media: An analysis of sina weibo. In AAAI Workshop on the World Wide Web and Public Health Intelligence, volume 31, page 59.
  • Weissenbacher et al. (2018) Davy Weissenbacher, Abeed Sarker, Michael J Paul, and Graciela Gonzalez-Hernandez. 2018. Overview of the third social media mining for health (smm4h) shared tasks at emnlp 2018. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop and Shared Task, pages 13–16.
  • Welvaert et al. (2017) Marijke Welvaert, Omar Al-Ghattas, Mark Cameron, and Peter Caley. 2017. Limits of use of social media for monitoring biosecurity events. PloS one, 12(2):e0172457.
  • Woo et al. (2016) Hyekyung Woo, Youngtae Cho, Eunyoung Shim, Jong-Koo Lee, Chang-Gun Lee, and Seong Hwan Kim. 2016. Estimating influenza outbreaks using both search engine query data and social media data in south korea. Journal of medical Internet research, 18(7).
  • Yan et al. (2006) Ping Yan, Daniel Zeng, and Hsinchun Chen. 2006. A review of public health syndromic surveillance systems. In International Conference on Intelligence and Security Informatics, pages 249–260. Springer.
  • Yan et al. (2017) SJ Yan, AA Chughtai, and CR Macintyre. 2017. Utility and potential of rapid epidemic intelligence from internet-based sources. International Journal of Infectious Diseases, 63:77–87.
  • Yangarber et al. (2008) Roman Yangarber, Peter Von Etter, and Ralf Steinberger. 2008. Content collection and analysis in the domain of epidemiology. In Proceedings of DrMED-2008: International Workshop on Describing Medical Web Resources.
  • Yates et al. (2014) Andrew Yates, Jon Parker, Nazli Goharian, and Ophir Frieder. 2014. A framework for public health surveillance. In Language Resources and Evaluation Conference, pages 475–482.
  • Yepes et al. (2015) Antonio Jimeno Yepes, Andrew MacKinlay, and Bo Han. 2015. Investigating public health surveillance using twitter. Proceedings of BioNLP, pages 164–170.
  • Zou et al. (2018) Bin Zou, Vasileios Lampos, and Ingemar Cox. 2018. Multi-task learning improves disease models from web search. In Proceedings of the 2018 World Wide Web Conference, pages 87–96. International World Wide Web Conferences Steering Committee.