Temporal Topic Modeling to Assess Associations between News Trends and Infectious Disease Outbreaks

06/01/2016 ∙ by Saurav Ghosh, et al. ∙ 0

In retrospective assessments, internet news reports have been shown to capture early reports of unknown infectious disease transmission prior to official laboratory confirmation. In general, media interest and reporting peaks and wanes during the course of an outbreak. In this study, we quantify the extent to which media interest during infectious disease outbreaks is indicative of trends of reported incidence. We introduce an approach that uses supervised temporal topic models to transform large corpora of news articles into temporal topic trends. The key advantages of this approach include, applicability to a wide range of diseases, and ability to capture disease dynamics - including seasonality, abrupt peaks and troughs. We evaluated the method using data from multiple infectious disease outbreaks reported in the United States of America (U.S.), China and India. We noted that temporal topic trends extracted from disease-related news reports successfully captured the dynamics of multiple outbreaks such as whooping cough in U.S. (2012), dengue outbreaks in India (2013) and China (2014). Our observations also suggest that efficient modeling of temporal topic trends using time-series regression techniques can estimate disease case counts with increased precision before official reports by health organizations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Infectious diseases are a threat to public health and economic stability of many countries. Open source indicators (e.g., news articles [brownstein2008surveillance, linge2009internet] , blogs [corley2010text] , search engine query volume [yuan2013monitoring, ginsberg2009detecting, santillana2014using, gu2015early] , social media chatter [denecke2012making, lee2013real, sugumaran2012real, paul2011you] and other sources [nsoesie2015monitoring]) are an attractive option for monitoring infectious disease progression, primarily due to their sheer volume and capacity to capture early signals of disease outbreaks, and in some cases, trends in population health-seeking behavior. However, most prior work in digital surveillance using open source indicators has targeted specific diseases, such as influenza [nsoesie2015monitoring, chakraborty2014forecasting] and hantavirus pulmonary syndrome (HPS) [ghosh2015rare] . Therefore, there is a need to develop generic frameworks that are applicable to multiple infectious diseases.

Official surveillance reports released by health organizations (e.g., CDC, WHO, PAHO) are published with a considerable delay of weeks, months or even a year. Therefore, traditional surveillance systems are not always effective at real-time monitoring of emerging public health threats. Unlike traditional surveillance data, informal digital sources, such as news media, blogs, and micro-blogging sites (Twitter) are typically available in (near) real-time. Proper mining of signals from these digital sources can effectively help in minimizing the time lag between an outbreak start and formal recognition of an outbreak, allowing for an accelerated response to public health threats. The gains in supplementing traditional surveillance using digital sources have been discussed in Nsoesie et al. [nsoesie2015computational] , Salathé et al. [salathe2012digital, salathe2013influenza] and Hartley et al. [hartley2013overview]

Our key contributions are as follows. (i) We introduce EpiNews, a generic temporal framework for analyzing disease-related news reports using a supervised topic model. The supervised topic model discovers multiple disease topics of interest and their associated temporal trends of prominence in news media. (ii) EpiNews captures trends in disease progression, such as periodicity, peaks and troughs via temporal trends of disease topics in news media. (iii) EpiNews also estimates disease incidence before official reports by health agencies using time-series regression models interposed over the temporal trends of disease topics.

We validated our method against disease case count reports, as available from public health agencies, in U.S., China and India. Disease-related news articles were provided by HealthMap [freifeld2008healthmap] , an internationally recognized, global disease alert system capturing outbreak reports from over 200,000 electronic news sources. EpiNews was evaluated on multiple outbreaks in the recent past, such as whooping cough in U.S. (2012) [cherry2012epidemic] , periodic outbreaks of avian influenza A(H7N9) [yang2014avian, gao2013clinical] and hand, foot, and mouth disease (HFMD) in China (2013 and 2014), periodic outbreaks of acute diarrheal disease (ADD) in India (2013 and 2014), major dengue outbreaks in China (2014) [shen2015multiple] and India (2013). Our experiments indicate that EpiNews was successfully able to capture the dynamics of the mentioned outbreaks and estimate the case counts, before official reports were published. However, inconsistent news coverage was found to adversely affect the performance of our method in certain scenarios.

Materials and Methods

Data sources

In this section, we discuss the data sources used to analyze the infectious disease outbreaks. We first describe the case count reports collected from public health agencies and complete our discussion about the HealthMap data used in this study.

Disease case counts.

For each country, we collected case count data corresponding to multiple diseases over a certain time period. In Table 1, we show the disease names (along with methods of transmission), health agencies from which case counts were collected, time period over which case counts were obtained and temporal granularity (daily, monthly, weekly or yearly) of the obtained case counts corresponding to each country.

Country
Disease names
(Methods of transmission)
Health
agencies
Time
period
Temporal
granularity
U.S.
Whooping cough (airborne, direct contact)
Rabies (zoonotic)
Salmonellosis (food-borne)
E. coli infection (waterborne, food-borne)
Project Tycho [van2013contagious]
(https://www.tycho.pitt.edu/)
January 2010 -
December 2013
Weekly
China
H7N9 (zoonotic)
HFMD (direct contact, airborne)

Dengue (vector-borne)

National Health and
Family Planning Commission
(http://en.nhfpc.gov.cn/)
January 2013 -
December 2014
Monthly
India
ADD (food-borne)
Dengue (vector-borne)
Malaria (vector-borne)
Integrated Disease
Surveillance Programme
(http://www.idsp.nic.in/)
January 2013 -
December 2014
Weekly
Table 1: Disease names (along with routes of transmission), health agencies from which case counts were collected, time period over which case counts were obtained and temporal granularity (daily, monthly, weekly or yearly) of the obtained case counts corresponding to each country. H7N9 stands for avian influenza A, ADD stands for acute diarrheal disease and HFMD stands for hand, foot, and mouth disease.
HealthMap.

Disease-related news articles were found to be indicative of infectious disease outbreaks [ghosh2015rare] . We collected such articles related to the mentioned diseases in Table 1, for each country under consideration, from HealthMap. The HealthMap corpus is a publicly available database from which we collected the disease-related articles, reported during the time period of interest. Each article contains the reported date and the corresponding location information in the form of (lat, long) co-ordinate pairs. We converted the location co-ordinates to location names (country, state) via reverse geocoding. Reverse geocoding is defined as the process of finding a readable address or place name for a given (lat, long) pair. For example, was converted to (United States, Florida) after reverse geocoding. Each HealthMap article was passed through a series of preprocessing steps. For China, majority () of the articles were published in either Traditional Chinese or Simplified Chinese. We translated the textual content of these articles to English for ease of analysis. The articles were preprocessed by removing non-textual elements, tokenization [webster1992tokenization, singh2014effective] , lemmatization [kanis2010comparison] and removal of stop words via BASIS Technologies’ Rosette Language Processing (RLP) tools [naren2014forecasting, doyle2014forecasting] . For more details on these steps, see subsection ‘HealthMap preprocessing’ within the section ‘Supplementary Information’ at the end of the manuscript. The set of unique words in these processed articles were found to contain general- (e.g., cold, contagious, nausea, blood, food-borne, waterborne, sanitation) as well as specific- (e.g., rabies, whooping, h7n9, dengue, salmonella, malaria) disease related terms. In Table 2, we show country-wise distribution of the total number of HealthMap news articles along with unique words and location names extracted from all the corresponding articles.

Country
Total number of HealthMap
news articles
Total number of unique words
Total number of unique location names
or (country, state) pairs
China 11,209 21,879 30
India 1,204 17,160 30
U.S. 9,872 59,687 51
Table 2: Country-wise distribution of the total number of HealthMap news articles along with unique words and location names extracted from all the corresponding articles.

Our next step was to extract the underlying topics related to the mentioned diseases in Table 1 and their associated temporal trends from the processed articles for each country. Following Rekatsinas et al. [ghosh2015rare] , the processed corpus for each country was transformed to a collection of tuples of the form , where count is the number of news articles mentioning the word associated with the location and time point in the tuple. For this transformation, we assumed that for each country, each processed article consists of words from a vocabulary , corresponds to a discretized time window and is geotagged with a location from a set of locations in the country. For China, disease case counts were available on a monthly granularity and as such each time point represents a period of 1 month. However, for diseases in U.S. and India, case counts were obtained on a weekly basis and as such time point represents a period of 1 week or more specifically, epidemiological week (hereafter referred to as epi week). For example, the tuple denotes that the word salmonella was mentioned in 9 articles referring to the state of Kansas in U.S. over the epi week extending from October 2013 to October 2013. For each country, let represent the collection of tuples for each location and denote the set of all tuple collections until time point . This transformed set was analyzed to extract the temporal trends of disease topics as discussed in the following section. Both and were updated for each country, as we proceed along the time window.

EpiNews

In this section, we describe in details the components of our proposed framework EpiNews. The first component is the supervised topic model used to extract temporal topic trends from . The second component, referred to as EpiNews-ARNet, is responsible for generating estimates of disease case counts using past available case counts and temporal topic trends extracted by the supervised topic model.

Temporal topic modeling

The first component of EpiNews deals with the topic and pattern discovery problem. The set of all tuple collections can be treated as a three-dimensional matrix of size where the dimensions are represented by words (size ), locations (size ) and time points (size ). Each element in represents the total number of articles mentioning the word () referring to location () over the time point (). We assume that each entry in a non-zero element of is associated with a latent disease topic and therefore, such hidden disease topics can be modeled in terms of three dimensions of . Our goal is to extract the hidden disease topics and their corresponding associations with each dimension of . Following previous literature on topic models [blei2003latent, blei2008supervised, ghosh2015rare, jagarlamudi2012incorporating] , we implemented a supervised temporal topic model for this purpose. We supervise the discovery process of each disease topic by providing a set of prior words (also called seed words) [jagarlamudi2012incorporating] . These seed words are user-provided prior knowledge of each infectious disease and they encourage the topic model to find evidence of these disease topics in the HealthMap corpus. This supervised method helps in improving the discovery of word co-occurrences within each topic as the model tends to discover words that are related to the words in the seed set. Additionally, we model time and location jointly [ghosh2015rare] with the word co-occurrence patterns. This enables tracking of temporal and spatial patterns of these disease topics in the news. For more details on the supervised topic model, see subsection ‘Generative process of the supervised topic model’ within the section ‘Supplementary Information’ at the end of the manuscript.

The supervised topic model takes as input, discovers disease topics and decomposes into four two-dimensional matrices as shown below. Each two-dimensional matrix represents the association between the discovered disease topics and the dimensions in .

  • : A

    matrix where each row represents a discrete probability distribution over the time points (

    ) for a specific topic . Each row of () represents the temporal topic trends or distribution for the disease topic .

  • : A matrix where each row represents a discrete probability distribution over the set of seed words for a specific topic . is hereafter referred to as the seed topic distribution.

  • : A matrix where each row represents a discrete probability distribution over the set of regular words for a specific topic . The set of regular words refers to all the words in vocabulary including the seed words. is hereafter referred to as the regular topic distribution.

  • : A matrix where each row represents a discrete probability distribution over topics for a specific location .

For more details on , , and , see subsection ‘Generative process of the supervised topic model’ within the section ‘Supplementary Information’ at the end of the manuscript.

Inference.

To compute the output parameters , , and in the supervised topic model given input observed data , we need to solve an inference problem. In topic models, exact computation is intractable [blei2003latent] and thus we are interested in approximate inference of the model parameters. Since collapsed gibbs sampling [steyvers2007probabilistic, matsubara2012fast, porteous2008fast] is a straight-forward, easy to implement, and unbiased approach that converges rapidly to a known ground-truth, it is typically preferred over other possible approaches [blei2003latent, minka2002expectation] in large scale applications of topic models [ghosh2015rare, matsubara2012fast, rosen2004author] . Thus we used collapsed gibbs sampling as the inference scheme for the supervised topic model. For more details on the inference process, see subsection ‘Inference via collapsed gibbs sampling’ within the section ‘Supplementary Information’ at the end of the manuscript.

Seed word extraction.

Seed words for each disease topic were extracted by examining the content of a subset of news articles mentioning the disease. Additionally, following similar techniques as in Chakraborty et al. [chakraborty2014forecasting] , we also examine a number of expert websites, such as CDC and WHO, to identify the most important keywords for a particular disease. Seed words used in this study are shown in Tables 5, 6 and 7 corresponding to diseases in U.S., China and India respectively.

Estimation of disease case counts

The second component of EpiNews is concerned with estimation of disease case counts using relevant information such as past case counts and temporal topic trends (). Let be the disease of interest. Without loss of generality, let the disease topic corresponds to . Furthermore, let denotes case counts of and denotes temporal trend value for disease topic at a time point . In general, reports of case counts published by health organizations are delayed (see Chakraborty et al. [chakraborty2014forecasting], Wang et al. [wang2015dynamic]) and hence, at time point case counts are available only till with a delay . However, temporal topic trend values () are available till T. Hence, we can formally define the case count estimation problem as estimating using past case counts () available till and temporal topic trends () available till . In general, disease case counts have a publication delay of 1 time point () and hence, estimating at is equivalent to 1-step ahead estimation.

EpiNews-ARNet.

For 1-step ahead case count estimation, we used a regularized version of autoregressive model with external input variables (ARX) where external input variables are represented by the temporal topic trends (

). We used Elastic Net [zou2005regularization] as the regularization model in ARX. This estimating component of EpiNews is designated as EpiNews-ARNet and defined below in equation (1).

(1)

where, is the estimated case count for disease at time point and are the regression coefficients fitted using Elastic Net constraints as given below in equation (2).

(2)

where, and are the regularization coefficients for the and components of Elastic Net, respectively. The Elastic Net combines the properties of Least Absolute Shrinkage and Selection Operator (LASSO) [tibshirani1996regression, hastie2009elements]

and Ridge regression 

[hastie2009elements] models. This combination allows for learning a sparse model like LASSO, while still maintaining the regularization properties of Ridge. If equals to 0, equation (2) equates to a Ridge estimator. On the other hand, if equals to 0, equation (2) corresponds to a LASSO estimator.

There are broadly two components to equation (1) which captures different signals about the diseases as follows. (i) Internal component (): This component is an autoregressive model that captures the signal embedded in past case counts and thus describes a delayed model. indicates the order of autoregression. (ii) External component (, , ): This component can also be thought of as an autoregressive component over the temporal topic trends () where is the number of time points to look back. The temporal topic trends are subjected to two additional transformations as follows. (a) Shift indicator (): Often, the incidence of news reports is not concurrent with the incidence of diseases, as recorded in the case counts. EpiNews-ARNet incorporates this information by shifting the temporal topic trend value by steps. The shift can be positive (indicating a lagging trend), negative (indicating a leading trend) or zero (indicating a co-incident trend). (b) Rolling transformation (): Disease case counts () do not follow a strictly linear relationship with temporal topic trends (). One of the simplest methods is to detrend the signals using difference of trend values instead of absolute values. However, our experiments showed that such transformations using a single time point often lead to unstable estimates. As such, we define a rolling transformation over a window length given below in equation (3).

(3)

Essentially, such transformations aim to capture the changes in trend values over a period and were found to be more indicative than absolute values. We ran a cross-validation step to find the optimal (, , , ) parameters.

Converting temporal topic trends to sampled case counts.

We described EpiNews-ARNet using the temporal topic trends or distribution () as the external input variables. It is to be noted that the disease case counts () and the temporal topic distribution () are typically at different numerical scales since values in a distribution range from 0 to 1. To improve numerical stability we converted the temporal topic distributions to estimated case counts using multinomial sampling [kerns2010introduction] over the time range. In multinomial sampling, samples are drawn from a multinomial distribution [kerns2010introduction] . The case counts estimated via multinomial sampling from the temporal topic distributions are hereafter referred to as sampled case counts. To calculate the sampled case counts () for disease , the corresponding temporal distribution for topic was used as the multinomial distribution and the total number of case counts available till at (due to delay in reporting of case counts) was used as the number of samples to be drawn from the distribution. See Algorithm (1) for more details.

Input : Temporal topic distribution:
Total number of case counts till time point :
Output : Sampled case counts from temporal topic distribution:
1
2
3 Draw time points using multinomial sampling where is the multinomial distribution and is the total number of samples to be drawn.
For each time point , sampled case count is calculated as the frequency of occurrence of in the above number of samples (time points) drawn from the multinomial distribution .
Algorithm 1 Multinomial sampling to convert temporal topic distribution to sampled case counts.

Results

In this section, we present an empirical evaluation of our proposed framework EpiNews. We first evaluated the disease topics discovered by the supervised topic model. Next, we analyzed whether the temporal topic trends () extracted by the supervised topic model are able to capture disease dynamics - including seasonality, abrupt peaks and troughs. Finally, we evaluated the quality of case counts estimated by EpiNews-ARNet against the actual disease case counts.

Disease topic discovery

To evaluate the discovered disease topics, we looked at the words having higher probabilities in the seed topic distributions () and regular topic distributions (). We present the analysis of and in Tables 5, 6 and 7 corresponding to disease topics in U.S., China and India respectively. For each country, both and were extracted from HealthMap data spanning over the entire time period shown in Table 1. For each disease topic (), we show the seed words and their corresponding probabilities (sorted in descending order) in the seed topic distribution . Seed words having higher probabilities in serve as informative prior words in the topic discovery process as they are mentioned frequently in news articles related to the disease topic. For example, seed words such as food, salmonella, product, fda, drug, contamination serve as informative prior words for the discovery of salmonellosis topic in U.S. since they have higher probabilities in the seed topic distribution (see Table 5). On the other hand, seed words such as enteritidis, newport provide less prior information due to their low probability values in the seed topic distribution. To understand how the supervised topic model discovers words from the HealthMap corpus related to these input seed words, we also show some of the regular words having higher probabilities in the regular topic distribution . For a particular disease topic, these regular words with higher probabilities are mentioned frequently in news articles related to that disease and also capture different aspects (causes and clinical symptoms, methods of transmission, etc.) of the disease that the topic represents. For example, in Table 5 we show these regular words (having higher probabilities in the regular topic distribution ) for the salmonellosis topic in U.S. Words such as diarrhea, nausea, vomit are related to clinical symptoms of salmonellosis. On the other hand, words such as eat, contaminated, restaurant, meat, beef are related to causes of salmonellosis.

Detection of outbreak patterns

We also examined the temporal distribution or trends () for each disease topic () in a specific country (Figures 1, 2 and 3) and their correlations with the disease case counts. For each country, temporal topic trends () were extracted from HealthMap data spanning over the entire time period shown in Table 1. We made several important observations as follows.

Disease seasonality.

In U.S., case counts of salmonellosis and E. coli infection exhibit strong periodic outbreaks, both peaking during the summer (see Figures 1 (e) and (g)). Temporal topic trends extracted by EpiNews were able to capture the periodicity of these two diseases, particularly periodic outbreaks of salmonellosis and E. coli infection in 2010, 2012 and 2013. However, during 2011, temporal topic trends failed to monitor the peak season properly though they show a tendency to increase during summer. For salmonellosis in 2013, the temporal topic trends captured the major peak of the outbreak at the start of the season while failing to capture the seasonal activity towards the end. For rabies, although the topic trends captured the general characteristics it failed to detect some major outbreaks, such as the outbreak in the summer of 2010 (see Figure 1 (c)).

In China, H7N9 and HFMD case counts exhibit strong periodic outbreaks, with H7N9 peaking during the winter and HFMD peaking during the summer (see Figures 2 (a) and (c)). For H7N9, temporal topic trends extracted by EpiNews were able to detect the seasonal outbreaks during March-April 2013 and January-February 2014. However, for HFMD, peaks in temporal topic trends precede the peaks in case counts during the summer of 2013 and 2014 respectively. Therefore, temporal topic trends for HFMD exhibit a negative shift (leading indicator) with respect to the case counts.

In India, case counts of ADD exhibit periodic outbreaks, peaking during the summer of 2013 and 2014 (see Figure 3 (a)). Temporal topic trends detected the seasonal outbreak in the summer of 2013 but failed to capture the outbreak in the summer of 2014.

Sudden peaks/troughs.

In U.S., whooping cough outbreaks do not exhibit yearly periodicity unlike salmonellosis and E. coli infection (see Figure 1 (a)). There was a major outbreak of whooping cough during the summer of 2012 and EpiNews detected this sudden increase (peak) in case counts by displaying higher topic trends during the entire period of the outbreak. EpiNews also did not detect outbreaks during periods (summer of 2011 and 2013) known to have low incidences (troughs) of whooping cough by displaying lower topic trends, suggesting low false alarm rate.

In China and India, dengue case counts exhibit seasonal outbreaks with peaks in case counts appearing during the months of September and October. However, China experienced a severe dengue outbreak in 2014 [shen2015multiple] in comparison to the outbreak in 2013 with the peak value of case counts exceeding 25,000 in the month of October (see Figure 2 (e)). Temporal topic trends detected this sudden massive increase in case counts by displaying a sharp spike during the outbreak period. India also experienced a large dengue outbreak in 2013 with the peak value of case counts exceeding 3,000 during a particular epi week in October (see Figure 3 (c)). EpiNews was able to detect this outbreak by displaying higher topic trends during the peak period. Malaria case counts in India exhibit irregular outbreaks or peaks (see Figure 3 (e)). EpiNews was successful in capturing majority of these outbreaks though it failed to detect some major peaks, such as the peak during the month of June 2014.

Sampled case counts.

Along with the temporal topic trends (), we also showed the corresponding sampled case counts () generated via multinomial sampling (see Algorithm (1)) from for a disease in Figures 1 ((b), (d), (f) and (h)), 2 ((b), (d) and (f)), 3 ((b), (d) and (f)). The figures show that the sampled case count values share similar numerical range as the disease case counts while maintaining shapes of the temporal topic trends. On the other hand, the temporal topic trend values are at different numerical range (ranging from 0 to 1) with respect to the case counts.

Estimating case counts

As official reports of case counts by health agencies are usually lagged by a single time point (week or month), reliable early estimates of disease incidence can facilitate the allocation of public health resources to enable effective control measures. Therefore, we aim to perform 1-step ahead estimation of disease case counts starting from a particular time point (For definition of 1-step ahead estimation, see subsection ‘Estimation of disease case counts’ within section ‘EpiNews’ of ‘Materials and Methods’). For the purpose of experimental validation, we used historical HealthMap data over a certain time period as the static training set in a specific country (referred to as the static training period) and progressively utilized the remaining time points as the evaluation period over which we evaluated the case count estimates of EpiNews-ARNet. To estimate case counts at a particular time point within the evaluation period, we utilized HealthMap data from up to and extracted disease topics using the supervised topic model. The disease case counts at were next estimated using past case counts available up to () and temporal topic trends (or, sampled case counts) available up to . In Table 3, we show the total time period of study, static training period and the evaluation period for each country.

Baselines.

For the task of 1-step ahead estimation, we compared the performance of EpiNews-ARNet against 2 baseline methods, namely EpiNews-ARMAX and Casecount-ARMA. In Casecount-ARMA, we fitted an autoregressive-moving-average model (ARMA [box2011time]) over past disease case counts to generate case count estimates. Casecount-ARMA doesn’t use any information related to temporal topic trends (). However, in case of EpiNews-ARMAX, we used an autoregressive–moving-average model with external input variables (ARMAX [box2011time]) where external input variables incorporate the information embedded in temporal topic trends. For more details on the baseline methods, see subsection ‘Baseline methods for case count estimation’ within the section ‘Supplementary Information’ at the end of the manuscript. We also compared temporal topic trends against sampled case counts (generated by multinomial sampling from the temporal topic trends) as the external input variables, for the applicable methods EpiNews-ARNet and EpiNews-ARMAX.

Evaluation.

We evaluated the case count estimates of each method over the evaluation period by comparing them against the actual case counts using normalized root-mean-square error (NRMSE). In Table 4, we present a comparative performance evaluation of the methods for 1-step ahead estimation in terms of NRMSE values corresponding to diseases in U.S., China and India respectively. Table 4 provides multiple insights as follows. (i) EpiNews-ARNet with sampled case counts as external variables is the best performing method achieving lowest NRMSE values for majority (8 out of 10) of the country, disease combinations. (ii) Two exceptions are China, HFMD and U.S., E. coli infection where EpiNews-ARNet and EpiNews-ARMAX with temporal topic trends as external variables achieve lowest NRMSE values respectively. (iii) Both EpiNews-ARNet and EpiNews-ARMAX perform better overall with sampled case counts as external variables than temporal topic trends. (iv) For none of the country, disease combinations, Casecount-ARMA is able to achieve lowest NRMSE values indicating the significance of incorporating temporal topic trends or sampled case counts as external variables for estimating case counts.

Country
Total time period
of study
Static training
period
Evaluation
period
U.S.
January 2010 - December 2013
January 2010 - December 2011
January 2012 - December 2013
China
January 2013 - December 2014
January 2013 - March 2013
April 2013 - December 2014
India
January 2013 - December 2014
January 2013 - November 2013
December 2013 - December 2014
Table 3: Total time period of study, static training period and the evaluation period for estimating disease case counts in each country.
Country Disease Casecount-ARMA EpiNews-ARMAX EpiNews-ARNet
with temporal
topic trends
with sampled
case counts
with temporal
topic trends
with sampled
case counts
U.S. Whooping cough 0.584 0.577 0.582 0.583 0.558
Rabies 0.875 0.888 0.886 0.877 0.865
Salmonellosis 0.445 0.978 0.450 0.441 0.430
E. coli infection 0.685 0.657 0.663 0.686 0.671
China H7N9 1.096 0.850 0.888 1.027 0.712
HFMD 1.574 1.524 1.538 0.622 0.626
Dengue 1.076 0.639 0.634 1.094 0.549
India ADD 1.226 1.285 1.119 0.844 0.833
Dengue 0.966 1.086 1.021 1.073 0.878
Malaria 1.060 1.062 1.047 1.016 0.963
Table 4: Comparing the performance of EpiNews-ARNet against the baseline methods EpiNews-ARMAX and Casecount-ARMA for 1-step ahead estimation of disease case counts. Metric used for comparing the case counts estimated by the methods against the actual case counts is the normalized root-mean-square error (NRMSE).

Discussion

In this paper, we studied the problem of monitoring and estimating outbreaks of multiple infectious diseases using disease-related online news reports obtained from HealthMap. We introduced EpiNews, a novel and generic temporal framework that combines supervised temporal topic models with time-series regression techniques to monitor and estimate disease incidence. Experimental results demonstrate that EpiNews is able to capture the time varying incidence of multiple diseases via temporal topic trends. Our experiments also illustrate that EpiNews can estimate disease incidence 1-step ahead with increased accuracy using information from temporal topic trends.

EpiNews uses online news reports as the sole data source to capture disease dynamics during outbreaks. Therefore, it is generic in the sense that it is not tailored to a particular disease or class of diseases. Moreover, the set of diseases selected for each country represent a diversity of transmission pathways as shown in Table 1. Hence, the applicability of EpiNews to these diverse sets of diseases as demonstrated in this study showcases the potential generalizability of our approach to different class of diseases.

Temporal topic trends extracted by EpiNews from HealthMap news reports successfully captured dynamics of multiple outbreaks, such as whooping cough in U.S. during summer of 2012, periodic outbreaks of salmonellosis and E. coli infection in U.S., periodic outbreaks of H7N9 and HFMD in China, dengue outbreaks in India (2013) and China (2014). However, there are certain deviations where temporal topic trends could not monitor the trends in disease outbreaks properly, such as salmonellosis and E. coli infection outbreaks in 2011, rabies outbreak in 2010 and ADD outbreak in 2014. We posit that such deviations are a factor of news media coverage during disease outbreaks, which is driven by interest. Moreover, our framework is heavily reliant on news-corpora and does not account for possible reporting errors. As such inconsistent interest-driven news coverage and articles with missing content affect the performance of our framework.

EpiNews supports monitoring and also 1-step ahead estimation of disease case counts with increased precision. Table 4 shows that EpiNews-ARNet yields lowest NRMSE values for majority of the diseases when compared to the baseline methods EpiNews-ARMAX and Casecount-ARMA. This implies that incorporating information from temporal topic trends via EpiNews-ARNet results in improved estimation of case counts. It is also to be noted that EpiNews-ARNet with sampled case counts as external variables achieves lower NRMSE for most of the diseases than the variant using temporal topic trends. This validates our claim that using sampled case counts instead of actual topic trends as the external variables adds numerical stability to EpiNews-ARNet. However, EpiNews-ARMAX is not able to provide significant performance improvement over Casecount-ARMA, which highlights the limitations of using ARMAX models in our framework for estimating case counts.

For dengue and HFMD in China, EpiNews-ARNet shows considerable improvement on 1-step ahead estimation of disease incidence when compared to the baselines, specifically Casecount-ARMA (see Table 4). In order to have a clearer understanding of the improved performance of EpiNews-ARNet with respect to the baselines, we plotted the temporal correlation between actual case counts and case counts estimated by the methods in Figure 4 corresponding to dengue and HFMD in China. It can be observed that EpiNews-ARNet with sampled case counts as external variables is able to estimate the peak in dengue case counts more accurately in comparison to the baselines (see Figure 4 (a)). For HFMD, EpiNews-ARNet with both topic trends and sampled case counts as external variables are able to estimate the peak in case counts, while the baselines fail to do so (see Figure 4 (b)). Casecount-ARMA’s inability to estimate the peaks in case counts for both dengue and HFMD implies that past case counts are not reliable indicators for estimating sudden increases or peaks in disease incidence and therefore, need to be augmented with disease signals from online news media for accurate estimation of outbreaks. However, inconsistent news coverage can adversely affect the timely estimation of outbreaks by EpiNews-ARNet as shown in Figure 4 (c). India experienced periodic outbreaks of ADD with peaks in case counts during the summer of 2013 and 2014. However, we observe a lack of news coverage (no peak in temporal topic trends) during the peak in 2014 compared to the peak in 2013 (see Figures 3 (a) and (b)). Therefore, the case count estimates generated by EpiNews-ARNet have a delayed peak with respect to the actual peak in case counts during the outbreak in 2014 (see Figure 4 (c)). This delayed peak is due to the internal component () in equation (1) which extracts information from past case counts.

Additional studies can focus on adapting EpiNews to news-corpora inconsistencies by leveraging information from other sources, such as climatic attributes (temperature [akil2014effects] , precipitation [curriero2001association] , humidity [hales2002potential]) for calibration purposes. However, observations in this study suggest that monitoring progression of infectious diseases is possible and disease incidence can be estimated with increased precision via efficient capturing of signals from online news media.

References

Acknowledgements

Saurav Ghosh, Prithwish Chakraborty, Sumiko R. Mekaru, John S. Brownstein and Naren Ramakrishnan, are supported by a research grant from the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center (DoI/NBC) contract number D12PC000337. Elaine O. Nsoesie is supported by funding from the National Institute of Environmental Health Sciences of the National Institutes of Health (Award Number K01ES025438). The US Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the US Government.

Supplementary Data

The data set used in this paper can be found in https://github.com/sauravcsvt/EpiNews_supplementary_data.

Supplementary Information

HealthMap preprocessing

Each HealthMap article was preprocessed using the following techniques.

  1. Removing non-textual elements. We extracted the main textual content of each article using Dragnet [peters2013content] and Goose (https://github.com/grangier/python-goose), ignoring the non-textual elements such as images within the article.

  2. Tokenization and lemmatization. Tokenization [webster1992tokenization, singh2014effective] is the process of segmenting a textual content into words, phrases, symbols or other meaningful elements commonly referred to as tokens. Lemmatization [kanis2010comparison] is performed after tokenization and can be defined as the normalization process in which various inflected forms of a word are converted to the same underlying lemma so that they can be analyzed as a single term. For example, terms such as travel, traveled, travels, TRAVEL, traveling, Travelling, travelling, travelled, Travel, Traveling were converted to the same underlying lemma travel. Both tokenization and lemmatization were performed on the extracted textual content using BASIS Technologies’ Rosette Language Processing (RLP) tools [naren2014forecasting, doyle2014forecasting] to generate a set of unique words or phrases corresponding to each article.

  3. Uppercase to lowercase. In this step, we converted the uppercase letters in each extracted word to lowercase letters. For example, both the terms Salmonella and salmonella convey the same meaning, so they were converted to a single term salmonella.

  4. Removal of stop words. In the final step, we removed all the stop words such as in, by, of, at, all, etc. from the set of unique words or phrases extracted from each article.

Generative process of the supervised topic model

In this section, we will discuss in details the generative process of the supervised topic model. Before going into the details of the generative process, we will first define the notion of a topic in the supervised topic model. In unsupervised topic models [blei2003latent, ghosh2015rare, matsubara2012fast, rosen2004author] , each topic is defined as a discrete probability distribution over all the words in the vocabulary . In the supervised topic model, the notion of a topic is extended and defined as the convex combination of two discrete probability distributions: seed topic distribution and regular topic distribution [jagarlamudi2012incorporating] . The seed topic distribution can only generate words from the seed set , and thus it is defined as a discrete probability distribution over only the words in the seed set . On the other hand, the regular topic distribution has the freedom to generate any word including the seed words. So a regular topic is defined as a discrete probability distribution over all the words in the vocabulary . Here we assume that each regular topic is associated with only one seed topic, i.e., there is a one-to-one correspondence between seed and regular topics.

1for each topic  do
2       Draw Draw Draw Draw
3for each location  do
4       Draw for each entry  do
5             Draw topic Draw indicator variable Draw Draw timestamp
6      
Algorithm 2 Generative process of the supervised topic model

The generative process of the supervised topic model is described in Algorithm (2). Given disease topics, locations and for each , the supervised topic model uses location and topic specific discrete probability distributions to model the generation of word and time point in each entry of . To generate each entry for a location , we first sample a topic () from the location-specific discrete probability distribution over disease topics. To generate a word , we choose either the seed topic distribution () or the regular topic distribution() corresponding to the sampled topic . The indicator variable sampled from Bernoulli () decides whether the word should be drawn from the seed topic distribution or the regular topic distribution. is called the sampling probability for topic . Once the distribution is chosen, the word is generated from it. Finally, the time point is drawn from the topic-specific discrete probability distribution over the time points .

Choice of priors.

() is drawn from an asymmetric Dirichlet prior [rubin2012statistical, wallach2009rethinking] parameterized by a -dimensional vector as defined below in equation (4).

(4)

where, is the sum of the count variable across those tuples () of where the word in the tuple is a seed word related to disease topic , is equal to the time point in equation (4) and refers to any location in the set . In other words, accounts for the occurrence of seed words related to topic in at time point . Higher occurrence of seed words indicates higher prominence of topic at time point and vice versa. Therefore, asymmetric prior is used to incorporate prior information into the supervised topic model regarding prominence of disease topic

at different time points. The hyperparameter

in equation (4) is an additional smoothing parameter that contributes a flat pseudocount to each component of . Additive smoothing is done to assign non-zero probabilities to those time points for which we have no prior information (zero occurrence of seed words) related to topic .

is also associated with an asymmetric Dirichlet prior parameterized by a -dimensional vector as defined below in equation (5).

(5)

where, is the sum of the count variable across those tuples () of where the word in the tuple is a seed word related to disease topic , is equal to the location in equation (5) and can be any time point in the range . In other words, accounts for the occurrence of seed words related to topic in . The hyperparameter is the additional smoothing parameter that contributes a non-zero pseudocount to each component of . Additive smoothing is done to assign non-zero probabilities to those locations for which we have no prior information (zero occurrence of seed words) related to topic .

Finally, seed topic distribution () and regular topic distribution () are drawn from symmetric Dirichlet priors [wallach2009rethinking] where each component of the parameter vectors (-dimensional) and (-dimensional) assumes the values of the hyperparameters and respectively, i.e., and .

Choice of hyperparameters.

A hyperparameter is defined as the parameter of a prior distribution. The hyperparameters , , and are set to , , and

respectively. These values are chosen heuristically, and an improved performance of the supervised topic model could be achieved via efficient hyperparameter optimization 

[wallach2009rethinking] . As suggested in Jagarlamudi et al. [jagarlamudi2012incorporating] , we set the sampling probability to a constant value of 0.7 for each topic .

Inference via collapsed gibbs sampling

The key problem in the supervised topic model is posterior inference. This amounts to reversing the defined generative process and inferring the output (latent) parameters , , and given the observed tuples in . A standard approach of posterior inference in topic models is collapsed gibbs sampling [griffiths2004finding]

, a Markov Chain Monte Carlo (MCMC) method.

To estimate the model parameters , , and

via collapsed gibbs sampling, we need to compute the conditional probability distribution

where represents the topic assignment for the tuple or entry in . represents the topic assignments for all entries in except the entry. We have three scenarios as shown below.

  • If word in the entry of is a regular word and is a regular topic, then the conditional probability distribution is defined below in equation (Inference via collapsed gibbs sampling).

    (6)
  • If word in the entry of is a regular word and is a seed topic, then the conditional probability distribution since a regular word cannot be generated from any of the seed topic distributions.

  • If word in the entry of is a seed word, then word can be generated from either the seed topic or the regular topic . If word is generated from a seed topic , then the conditional probability distribution is defined below in equation (7). On the other hand, if word is generated from a regular topic , then the conditional probability distribution is defined below in equation (8).

    (7)
    (8)

In equations (Inference via collapsed gibbs sampling), (7) and (8), denotes the number of times word is assigned to regular topic across all entries in except the entry, denotes the number of times seed word is assigned to seed topic across all entries in except the entry, denotes the number of times time point is assigned to topic across all entries in except the entry and denotes the number of times location is associated with topic across all entries in except the entry. refers to the component of and denotes the component of corresponding to time point .

Implementing the collapsed gibbs sampler.

Collapsed gibbs sampler for the supervised topic model is surprisingly easy to implement. It involves setting up the required count variables, randomly initializing them, and then the gibbs sampler executes in an iterative fashion where on each iteration a topic is sampled for each entry in according to equation (Inference via collapsed gibbs sampling) or equation (7) and equation (8) depending on whether the word in the entry is a regular word or a seed word respectively. The required count variables include , , and corresponding to the entry in . For simplicity and efficiency, we also keep a running count of (, the total number of times any word in vocabulary is assigned to topic ), (, the total number of times any word in the set of seed words is assigned to the corresponding seed topic ), (, the total number of times any time point is assigned to topic ) and (, the total number of times any topic is associated with location ). Finally, in addition to the mentioned count variables, we also require an array which will contain the topic assignment for each entry or tuple in . Once we choose a topic for a particular entry in , the chosen topic is set in the array and the count variables are incremented in the appropriate position relevant to the entry. Following the gibbs iterations, the count variables can be used to compute the output (latent) parameters , , and as shown below in equation (Implementing the collapsed gibbs sampler.).

(9)

where, represents the probability of topic given location , represents the probability of word given topic , represents the probability of seed word given seed topic and denotes the temporal trend value of topic at time point . We ran the gibbs sampler for 300 iterations.

Baseline methods for case count estimation

We compared EpiNews-ARNet with 2 baseline methods, namely Casecount-ARMA and EpiNews-ARMAX. In Casecount-ARMA, we fitted an autoregressive-moving-average model (ARMA(, [box2011time]) over past disease case counts to generate case count estimates as shown below in equation (10).

(10)

where, and are the orders of the autoregressive (AR) and moving average (MA) components, respectively.

represent the white noise error terms. For further details including boundary conditions of ARMA, please refer to Box et al. 

[box2011time]. Casecount-ARMA doesn’t use any information related to temporal topic trends (). However, in EpiNews-ARMAX, we used an autoregressive–moving-average model with external input variables (ARMAX(,[box2011time]). As shown below in equation (11), ARMAX(, ) incorporates information from both past case counts and temporal topic trends () in order to estimate case counts. Similar to EpiNews-ARNet, external input variables are represented by the temporal topic trends ().

(11)

where, and are the orders of the autoregressive (AR) and moving average (MA) components, respectively. For further details, please refer to Box et al. [box2011time].

Figure 1: Correlation between disease case counts and temporal topic distributions or trends () extracted by EpiNews for (a) whooping cough, (c) rabies, (e) salmonellosis, and (g) E. coli infection in U.S. Along with the temporal topic trends (), we also showed the correlation between disease case counts and sampled case counts (generated by multinomial sampling from temporal topic trends) for (b) whooping cough, (d) rabies, (f) salmonellosis, and (h) E. coli infection. Note, the sampled case counts and disease case counts share almost similar numerical range. However, the temporal topic trend values are at different numerical range (ranging from 0 to 1) with respect to the disease case counts.
Figure 2: Correlation between disease case counts and temporal topic distributions or trends () extracted by EpiNews for (a) H7N9, (c) HFMD, and (e) dengue in China. Along with the temporal topic trends (), we also showed the correlation between disease case counts and sampled case counts (generated by multinomial sampling from temporal topic trends) for (b) H7N9, (d) HFMD, and (f) dengue. Note, the sampled case counts and disease case counts share almost similar numerical range. However, the temporal topic trend values are at different numerical range (ranging from 0 to 1) with respect to the disease case counts.
Figure 3: Correlation between disease case counts and temporal topic distributions or trends () extracted by EpiNews for (a) ADD, (c) dengue, and (e) malaria in India. Along with the temporal topic trends (), we also showed the correlation between disease case counts and sampled case counts (generated by multinomial sampling from temporal topic trends) for (b) ADD, (d) dengue, and (f) malaria. Note, the sampled case counts and disease case counts share almost similar numerical range. However, the temporal topic trend values are at different numerical range (ranging from 0 to 1) with respect to the disease case counts.
Figure 4: Temporal correlation between actual case counts and case counts estimated by the methods Casecount-ARMA, EpiNews-ARMAX and EpiNews-ARNet corresponding to (a) dengue and (b) HFMD in China. In (a) and (b), EpiNews-ARMAX-topic and EpiNews-ARNet-topic use temporal topic trends as external variables. On the other hand, EpiNews-ARMAX-sample and EpiNews-ARNet-sample use sampled case counts as external variables. In (c), we showed the temporal correlation between actual case counts and case counts estimated by EpiNews-ARNet-sample corresponding to ADD in India.
Whooping cough topic Rabies topic Salmonellosis topic E. coli infection topic
Seed words Seed words Seed words Seed words
child 0.1498
school 0.1068
cough 0.0828
pertussis 0.0701
whoop 0.0691
whooping 0.0679
infant 0.0596
student 0.0557
contagious 0.0454
booster 0.0406
cold 0.0395
coughing 0.0309
nose 0.0304
respiratory 0.0284
mild 0.0269
tdap 0.0231
immunize 0.0212
runny 0.0198
tetanus 0.0175
breathe 0.0144
animal 0.1596
rabies 0.1191
rabid 0.0718
bite 0.0695
rabie 0.0674
virus 0.0649
wild 0.0585
bat 0.0472
raccoon 0.0471
skunk 0.0424
fox 0.0422
wildlife 0.0379
domestic 0.0323
saliva 0.0247
scratch 0.0237
quarantine 0.0213
horse 0.0192
viral 0.0190
livestock 0.0166
mammal 0.0156
food 0.2056
salmonella 0.1031
product 0.1013
recall 0.0878
drug 0.0712
consumer 0.0705
contamination 0.0598
fda 0.0579
contaminate 0.0567
abdominal 0.0351
egg 0.0277
chicken 0.0275
poultry 0.025
arthritis 0.0145
peanut 0.0139
cantaloupe 0.01
shell 0.0086
typhimurium 0.0083
newport 0.0082
enteritidis 0.0074
coli 0.2265
boil 0.0887
cell 0.0745
toxin 0.0628
escherichia 0.0617
clinical 0.0573
chemical 0.0557
kidney 0.0414
microbiology 0.0402
reaction 0.0397
hemolytic 0.0376
lettuce 0.0366
uremic 0.036
physical 0.0342
gene 0.0339
shiga 0.0202
expression 0.0162
chemistry 0.0149
stec 0.0125
biochemistry 0.0094
Regular words with higher probabilities Regular words with higher probabilities Regular words with higher probabilities Regular words with higher probabilities
contact 0.0037
young 0.0023
adult 0.0022
vaccination 0.0019
vaccine 0.0019
california 0.0019
vaccinate 0.0018
parent 0.0017
woman 0.0015
baby 0.0014
immunization 0.0011
kid 0.0009
air 0.0008
weather 0.0007
pregnant 0.0006
mother 0.0006
dose 0.0006
antibiotic 0.0005
pneumonia 0.0003
pet 0.0037
contact 0.0037
cat 0.0028
vaccination 0.0024
florida 0.0015
vaccine 0.0014
shot 0.0014
street 0.0013
clinic 0.0012
texas 0.0010
park 0.0010
york 0.0010
wound 0.0009
virginia 0.0008
ferret 0.0007
brain 0.0007
coyote 0.0005
nervous 0.0005
canine 0.0002
eat 0.0019
diarrhea 0.0019
nausea 0.0013
foodborne 0.0013
package 0.0012
contaminated 0.0011
meat 0.0011
restaurant 0.0010
vomit 0.0010
products 0.0008
cook 0.0008
beef 0.0008
raw 0.0007
temperature 0.0006
honey 0.0005
pepper 0.0004
weather 0.0003
salad 0.0003
mango 0.0002
transmit 0.0014
massachusetts 0.0013
surface 0.0012
body 0.0012
pennsylvania 0.0012
blood 0.0012
pathogen 0.0011
resistant 0.0011
drinking 0.0011
agricultural 0.0011
hygiene 0.0010
raw 0.0009
apple 0.0009
sandwich 0.0009
milk 0.0008
stool 0.0008
parasite 0.0005
acs 0.0002
receptor 0.0001
Table 5: Four disease topics (Whooping Cough, Rabies, Salmonella and E. coli infection) discovered by the supervised topic model from the HealthMap corpus for U.S. For each disease topic, we show the seed words and their corresponding probabilities in the seed topic distribution. Along with the seed words, we also show some of the regular words (having higher probabilities in the regular topic distribution) discovered by the supervised topic model related to these input seed words.
H7N9 topic HFMD topic Dengue topic
Seed words Seed words Seed words
flu 0.1229
bird 0.1225
avian 0.1053
influenza 0.1051
human 0.1031
virus 0.0832
poultry 0.0786
market 0.0610
animal 0.0360
chicken 0.0303
respiratory 0.0230
spring 0.0227
farm 0.0224
farmer 0.0213
slaughter 0.0194
winter 0.0179
egg 0.0125
pandemic 0.0117
h7n9 0.0012
h5n1 0.0000
hand 0.1573
child 0.1384
mouth 0.1127
school 0.1016
foot 0.0916
class 0.0734
hfmd 0.0557
parent 0.0546
nursery 0.0343
kindergarten 0.0294
oral 0.0192
intestinal 0.0185
infant 0.0178
mumps 0.0174
measles 0.0172
herpes 0.0140
enterovirus 0.0135
encephalitis 0.0124
dysentery 0.0117
ulcer 0.0093
fever 0.2269
dengue 0.1586
mosquito 0.1052
october 0.0826
water 0.0682
breeding 0.0559
street 0.0481
bite 0.0330
aedes 0.0317
pain 0.0294
breed 0.0280
park 0.0269
sanitation 0.0179
borne 0.0175
albopictus 0.0168
rain 0.0139
hemorrhagic 0.0125
vector 0.0115
larva 0.0089
aegypti 0.0066
Regular words with higher probabilities Regular words with higher probabilities Regular words with higher probabilities
zhejiang 0.0034
beijing 0.0034
shanghai 0.0030
agriculture 0.0015
pneumonia 0.0013
temperature 0.0011
food 0.0010
eat 0.0009
duck 0.0008
pigeon 0.0008
cook 0.0006
vaccine 0.0006
tamiflu 0.0005
meat 0.0004
strain 0.0004
raw 0.0003
pig 0.0003
shandong 0.0028
hunan 0.0025
care 0.0015
rash 0.0008
meningitis 0.0007
viral 0.0007
hepatitis 0.0007
body 0.0006
tuberculosis 0.0006
childhood 0.0005
palm 0.0004
organ 0.0003
skin 0.0003
buttock 0.0003
childcare 0.0003
blister 0.0002
kidney 0.0002
guangdong 0.0071
guangzhou 0.0056
site 0.0013
temperature 0.0010
weather 0.0009
muscle 0.0008
blood 0.0006
urban 0.0005
bleed 0.0004
diarrhea 0.0004
medicine 0.0004
stagnant 0.0004
spray 0.0003
rainy 0.0003
climate 0.0003
cough 0.0002
tank 0.0002
Table 6: Three disease topics (H7N9, HFMD and dengue) discovered by the supervised topic model from the HealthMap corpus for China. For each disease topic, we show the seed words and their corresponding probabilities in the seed topic distribution. Along with the seed words, we also show some of the regular words (having higher probabilities in the regular topic distribution) discovered by the supervised topic model related to these input seed words.
ADD topic Dengue topic Malaria topic
Seed words Seed words Seed words
fall 0.1284
child 0.1148
school 0.0949
student 0.0868
food 0.0837
consume 0.0611
eat 0.0588
vomit 0.0549
meal 0.0525
stomach 0.0412
diarrhea 0.0315
nausea 0.0304
vomiting 0.0300
poisoning 0.0249
poison 0.0241
midday 0.0237
contaminated 0.0183
cook 0.0179
lunch 0.0117
contaminate 0.0105
dengue 0.2090
fever 0.0978
municipal 0.0759
breeding 0.0658
borne 0.0586
mosquito 0.0555
september 0.0491
august 0.0429
water 0.0408
rain 0.0385
aedes 0.0382
ward 0.0382
platelet 0.0330
breed 0.0300
larva 0.0268
blood 0.0264
bite 0.0246
chikungunya 0.0206
vector 0.0199
monsoon 0.0084
malaria 0.1504
mosquito 0.1166
site 0.0994
water 0.0893
awareness 0.0826
lead 0.0735
vector 0.0678
breed 0.0567
monsoon 0.0484
blood 0.0414
construction 0.0331
camp 0.0316
drug 0.0228
rainfall 0.0175
typhoid 0.0148
tribal 0.0133
falciparum 0.0114
economic 0.0110
anopheles 0.0099
plasmodium 0.0084
Regular words with higher probabilities Regular words with higher probabilities Regular words with higher probabilities
village 0.0032
bihar 0.0023
inflammatory 0.0020
sample 0.0018
odisha 0.0017
ache 0.0011
sick 0.0010
pain 0.0008
iron 0.0008
rice 0.0006
pesticide 0.0005
flood 0.0004
drink 0.0004
sanitation 0.0004
stale 0.0003
drinking 0.0001
civic 0.0038
delhi 0.0026
virus 0.0018
temperature 0.0014
fogging 0.0014
haryana 0.0009
spray 0.0009
stagnant 0.0008
infection 0.0008
aegypti 0.0008
drain 0.0007
larval 0.0006
stagnate 0.0005
gutter 0.0003
rainwater 0.0002
urbanization 0.0001
mumbai 0.0025
virus 0.0015
maharashtra 0.0011
stagnant 0.0011
insect 0.0009
garbage 0.0008
flu 0.0008
spraying 0.0007
aegypti 0.0007
parasite 0.0006
tank 0.0006
leptospirosis 0.0005
urban 0.0004
drainage 0.0003
rainwater 0.0002
waterlog 0.0002
Table 7: Three disease topics (ADD, dengue and malaria) discovered by the supervised topic model from the HealthMap corpus for India. For each disease topic, we show the seed words and their corresponding probabilities in the seed topic distribution. Along with the seed words, we also show some of the regular words (having higher probabilities in the regular topic distribution) discovered by the supervised topic model related to these input seed words.