Chronic diseases (such as diabetes, cancer and heart diseases) cause of deaths in the United States every year, even though many of those diseases are preventable . The goal of holding health awareness events is to raise attention and educate the public about diseases. Take the National Breast Cancer awareness month as an example: the National Breast Cancer Foundation devotes efforts to educating women on early detection to reduce the risk of breast cancer, helping those diagnosed with breast cancer, as well as raising funds to support research. Companies join the National Breast Cancer Awareness Month, such as Estée Lauder Companies Inc. who releases exclusive Pink Ribbon products to help improve awareness of breast cancer and raise funds for medical research .
It is estimated thatof the information flowing through two-way telecommunication were carried by the Internet by 2007 . The number of Internet users has increased enormously and surpasses 3 billion or about of the world population in 2014 . Google has led the U.S. core search market for the past decade , and millions of people worldwide use it to search for health topics every day . In particular it occupied three quarters of the search engine market in 2017.
We want to determine if effective health awareness events are effective in raising public awareness of the health topic resulting in higher Google search frequencies. The results could benefit a variety of parties, for instance, the Department of Public Health and public interest groups could optimally rearrange resources allocation among events.
1.2 Related Work
Using Internet statistics to explain and predict quantities has been popular among researcher. Bollen et al. classified tweets into different moods to quantify the daily public mood and used it to predict stock market by using different models. The idea was based on the fact that people intentionally or unintentionally disclosed their thinking online by some means including social media such as Twitter, which might be a factor of stock price variation. What was interesting was that the authors used tweets which was not traditionally considered as an economic factor unlike some classical factors such as interest rates, GDP, and unemployment rates.
proved that Google Trends data could be predictive for current influenza-like activity levels by 1-2 weeks earlier before conventional centers for disease control and prevention surveillance systems by comparing GT data and the actual disease numbers and provided different case studies. The search frequency would dramatically increase before and during the disease outbreak. Similarly, Cook et al. chose H1N1 ease cases. The increasing search frequency could be useful in identifying the presence of diseases and the media effect on web users’ search behaviors .
GT data was proven to be effective in terms of modeling other areas such as marketing and information security. Youn et al. used GT data and Autoregressive Integrated Moving Average (ARIMA) models to conduct nowcast for TV market of a few brands and was able reveal the correlation. Accurate prediction for the near future of the market was obtained. Rech  used GT data to analyze the attention that products received and the cause-effect relation among a few factors in software engineering. Kuo et al  demonstrated the lifecycles of internet security systems had the same pattern including four stages: zero day, publicity, cooldown and silence with different scales. The author discovered that GT data showed a interesting correlation with the lifecycle and claimed the reason was that when the vulnerability attracted a lot attention, the risk became large and the lifecycle turned to the decay stages. Choi et al  was able to conduct time series analysis on GT data to forecast some economic indicators, and showed that some GT data was very well fitted by ARIMA models. In their case, they focused on the intrinsic structure of the data sets without incorporating any explanatory variables or time series. Mondal et al  used transfer function noise model to study the effect of monthly rain fall on the Ganges River flow, with both data sets being time series. In our case, we will use an impulse series as the explanatory.
Ari Seifter show that GT data was high related to the public attention on diseases according to a study on Lyme disease. Grant analyzed the number of articles published and number of early detection of disease in the event month for breast cancer and concluded that the event did promote public attention. The study quantitatively indicated that a successful event actually educated public and encouraged early detection. Here we want to identify the effective ones from a pool of events. In , Ayers et al studied the Great American Smokeout health awareness event by using a number of data sets such as number of news, tweets, Wiki visits and etc. Their proposed evaluation method for event effectiveness was to first fit counterfactual data by assuming the event had not occurred, then compare them with the actual data. Although their approach was quantitative, they used the percent change where it is unclear detect the threshold of significance.
2 Datasets and Preprocessing
We focus on monthly health awareness events in the US and select a set of 46 events on disease. Since GT data is based on the search frequency of one or a few words which we call a query, we select a query for each event and present them in Appendix A. In fact, for some events, there were more than one meaningful queries, then we picked the one with highest frequency.
On Google Trends webpage, users are able to track the search popularity of queries in different languages across regions starting from January 2004. Weekly or monthly GT data may be downloaded as a CSV file depending on the total time range. Since the pure values of queries can be huge numbers, Google rescales them in a range from 0 to 100 with the highest frequency being 100. Four options, including Region, Time, Category and Search Type are needed to specify a search and are selected as United State, 2004-2017, Health, and Web search respectively in this work.
For example, Figure 1 shows the query of Breast Cancer as a time series plot. There was also a graph showing popularity over regions as shown in Figure 2. the top three subregions of search popularity were Pennsylvania, Maryland, Alabama.
2.2 Data Preprocessing
Monthly data from 2004 to 2017 for 46 selected queries are collected. All data points are integers between 0 and 100, with no missing data. We rescale every month to an equal length of 30 days to reduce the variation caused by uneven number of days. In particular, January, March, May, July, August, October, and December data points are multiplied by , and February data points are multiplied by .
We provide three different quantitative methods to evaluate the effectiveness with their thresholds clearly stated. The main method is to use transfer function noise modelling with impulse series as input. Then inferences based on Wilcoxon Rank Sum test and Binomial distribution are used to consolidate results.
3.1 Transfer Function Noise Model
The (Seasonal) Autoregressive Integrated Moving Average models (ARIMA or SARIMA) make interpretation and forecast by developing the intrinsic pattern of the single response time series. A general SARIMA has the form:
where is the backshift operator, ,
is a white noise, and, , , and are constant coefficients. This model can be expressed by a more compact notation as:
If there is another series, say which we call an input series that has a relationship with . The Transfer Function Noise Model is built to describe this situation as
Intuitively, is determined by the structure of input and measures the effect of on , and measures the intrinsic pattern with itself.
We construct an impulse time series with if it corresponds a non event month, and if it corresponds an event month. We want to analyze the effect of towards . (3.1) is called the Intervention model, whose operator usually has a fairly simple form. We let , and we are interested in how much the impulse contributes to the current response which results in:
We first determine whether there is a seasonality in each data set, that is whether an ARIMA model or a SARIMA model should be used and then fit the best ARIMA/SARIMA model.
Secondly, we fit a transfer function noise model. The input series is just impulse function, thus there is no prewhitening step. To determine the orders of and in (3.2), we use two attempts and choose the better one:
The first attempt will be simply to use the same order as the ARIMA/SARIMA. In second attempt, we first replace the event month data with the average of the previous and next month. The idea is that after this replacement, the new data is our best guess for what the data would be if there were no event happening. We use the new data to determine the orders of the ARIMA/SARIMA model and use them in (3.2). The better attempt is chosen as the final transfer function noise model.
We will conclude that the event contributes to the number of search if the transfer function noise model is better fitted than the ARIMA/SARIMA model, and the parameter is significant at level.
3.2 Wilcoxon Rank Sum Test
showed that Wilson test usually holds large power advantages over t test and is asymptotically more efficient than t test. In our case, the sample sizes are unequal and the sample distributions are unclear, thus we believe the Wilcoxon Rank-Sum is more appropriate than the t-test.
Data points are splitted into two groups as event month and non event month. The question then become that if event-month group has larger values. The null hypothesis is that the two group of observations came from the same population. The Wilcoxon test is based upon ranking data points of the combined sample. Assign numeric ranks to all the observations with 1 being the smallest value. If there is a group that ties, assign the rank equal to its average ranking. The Wilcoxon rank-sum test statistic is the sum of the ranks for observations from one of the samples and therefore are calculated as:
where and are the two sample sizes; and are the sums of the ranks in samples and respectively. The smaller value between and is the one used to consult significance tables to estimate the p-value.
3.3 Inference by Binomial Distribution
We used the null hypothesis that the search frequencies were completely random implying that the event did not have effect. Under the null hypothesis, every month has equal probabilityto be the peak since all selected diseases are not seasonal as an influenza-like illness. Let be the number of yearly peaks for event-month data in 14 years. Among 14 years, the probability that a certain month appears to be the peak times is
In particular, is the largest value making the probability less than 0.05, and . Therefore, that the event month appears to be the peak at least 4 times indicates evidence that the event-month data is significantly different from the other months.
Health awareness events that show evidence of significance in all three method decribed above will be defined as effective health awareness events. Health awareness events that have insignificant results for all three tests will be defined as ineffective health awareness events. The events with inconsistent results by different methods will be defined as unclear.
Details for two selected events as case study are presented in this chapter. All 46 selected query data have been analyzed and presented in table B in Appendix.
4.1 Case 1: National Breast Cancer Awareness Month
One out of eight women in the USA are diagnosed with breast cancer , and breast cancer is the top cause of cancer death for women 40 to 50 years of age  and the second leading cause of cancer death for women in the USA . The National Breast Cancer Awareness Event is dedicated to drawing public attention on prevention and early detection, supporting the patients and fundraising for scientific research.
The time series plot as shown in Figure 3 presented a slightly declining trend, with peaks at the event months, October. Three different tests including periodogram, auto-correlation function, and linear model comparison are conducted to check the seasonality. For breast cancer data, two of the three tests indicated that there is no seasonality, therefore we choose ARIMA model instead of SARIMA and obtain the best ARIMA model and transfer function model as described in section 4.1.
The results are shown in table 1. We see that the Adjust is about 0.41 for the ARIMA model and is about 0.58 for the transfer function noise model, and the p-value for parameter “eventmonth" is . Therefore we conclude that the event has a significant effect on the number of search for breast cancer.
|Orders||Adjusted R square||p value of event coefficient|
Next, to conduct the Wilcoxon rank sum test, we split the data into event month subset and non event month subset. A p-value indicate that we shall reject the null hypothesis that two groups of observations come from the same population. Further we notice that the mean of the event months is greater than non event months, thus during event months the search frequencies are higher than the rest of the year.
For the Binomial approach, among 14 years of Google Trends data of the query breast cancer, we have found that all 14 yearly peaks happen in October(see Color Figure 4) which is greater than the threshold, 4. There is evidence to prove that event-month frequencies are greater than the other months’.
In sum, all our results consistently indicate that the National Breast Cancer Awareness event is effective in increasing search frequency of breast cancer in October.
4.2 Case 2: American Stroke Awareness Month
Strokes are one of the leading causes of death and serious long-term disability in the USA . More than 795,000 Americans have a stroke every year and about 130,000 people have been killed by a stroke in the USA each year . To get insight into public awareness for American Stroke Awareness Month, Google Trends data of query stroke has been obtained.
The time series plot as shown in Figure 5 (a) presents a slight decline trend before the year 2011 and an uptrend after the year 2011. We use a R function which uses three different tests including peridogram, auto-correlation function, and linear model comparison to check the seasonality. For stroke data, all three tests indicate that there is seasonality, meaning SARIMA model should be used. The outputs for SARIMA model and transfer function noise model are presented in Figure 2. We see that the Adjust is about 0.62 for the transfer function noise model which is no better than the one for SARIMA model which is about 0.68, and the p-value for parameter “eventmonth" is about . Therefore we do not have evidence to conclude that the event has a significant effect on the number of search for Stroke.
|Orders||Adjusted R square||p value of event coefficient|
According to the one-side Wilcoxon Rank-Sum test statistics, we have p-value, which means we have no compelling evidence that there is higher search frequency for the query “strokes” in the event month of May.
From the years 2004 to 2017, we have only one peak in May (See color Figure 6) which is less than the threshold of four peaks. In sum, all our results consistently indicate that the there is no evidence that the Stroke Awareness event is effective in increasing search frequency of stroke in May.
Ten events are concluded to be effective in raising public search frequency about related diseases: Alcohol Awareness, Autism, Breast Cancer, Colon Cancer, Dental Health, Heart Disease, Immunization, National Nutrition, Ovarian Cancer, and Sids. Eight events are unclear according to inconsistent results and the others are ineffective.
5 Conclusion and Discussion
According to the analysis of all 46 data sets, we have found that 10 health awareness events are effective health awareness events by showing strong evidence of significant seasonal patterns with peaks matching the event month, 28 events are defined as ineffective health awareness events and the rest are defined as unclear health awareness events. Although lack of attention is definitely bad, overheating events may result in possessing too much public resources and weakening the severity of other health topics.
People may suspect that the effective events should have higher frequencies than others, or the opposite. In fact, we checked the relative frequencies across effective, unclear, and ineffective events, and found that there is no relationship. There are effective events with high search frequencies and low search frequencies, and vice versa.
Another interesting thing to notice is that Diabetes was classified as unclear, which was somehow counterintuitive. We compared all eight unclear events and found out that the frequency for Diabetes is absolutely the largest, while all other 7 events are relatively closed to each other but away from Diabetes. We suspected that the Diabetes is so influential that a considerable attention was paid on it during many months over a year which made the event month insignificant. Therefore, a possible future study is to think about if some of the unclear and inffective events are similar to the case of Diabetes. We may also consider the prevalence and severity of these disease, since obviously it is not practical to make all disease as well-known as heart disease or breast cancer.
Classification within this study will be beneficial for the public health management and health awareness for public welfare. The Department of Public Health and public interest groups need to optimally rearrange resources allocation between effective health awareness events and ineffective health awareness events to improve the awareness of ineffective health awareness events topics, especially. Corporate partners would take the opportunity to promote related products or services to effective health awareness events, such as pink-ribbon brooch, exclusive pink-ribbon products, and the clinic needs to be prepared for increased demands of health screening appointments.
-  CDC, National Prevention Strategy: America’s Plan for Better Health and Wellness, 2014. [Online]. Available: \(https://www.surgeongeneral.gov/priorities/prevention/strategy/report.pdf\) (last accessed on July 30, 2018)
-  Centers for Disease Control and Prevention, Update on Overall Prevalence of Major Birth Defects–Atlanta, Georgia, 1978-2005., 2008. [Online]. Available: \(http://www.cdc.gov/mmwr/preview/mmwrhtml/mm5701a2.htm\) (last accessed on July 30, 2018)
-  M. Hilbert and P. Lopez, The World’s Technological Capacity to Store, Communicate, and Compute Information., Science, vol. 332, no. 6025, pp. 60-65, 2011.
-  Internet Society, Internet Society Global Internet Report 2014, 2014. [Online]. Available: \(https://www.internetsociety.org/globalinternetreport/2014/\) (last accessed on March 30, 2018)
-  comScore, comScore Search Engine Rankings [Online]. Available: \(https://www.statista.com/statistics/267161/market-share-of-search-engines-in-% the-united-states/\) (last accessed on July 30, 2018)
-  H. A. Johnson, M. M. Wagner, W. R. Hogan, W. Chapman, R. T. Olszewski, J. Dowling and G. Barnas, Analysis of web access logs for surveillance of influenza, Stud Health Technol Inform, Vols. 107:1202-6, 2004.
-  H. A. Carneiro and E. Mylonakis, Google Trends: A Web-Based Tool for Real-Time Surveillance of Disease Outbreaks, Clinical Infectious Diseases, Vols. 49:1557-64, 2009.
-  J. Bollen, H. Mao and X Zeng, Twitter Mood Predicts the Stock Market, Journal of Computational Science, Vol 2, pp. 1-8, 2011.
-  J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolinski and L. Brilliant, Detecting influenza epidemics using search engine query data, Nature, Vol. 457, 2009.
-  J. A. Doornik, Improving the Timeliness of Data on Influenza-like Illnesses using Google Search Data, University of Oxford, Technical report, pp. 1-21, 2009.
-  H.A. Carneiro and E Mylonakis, Google Trends: a Web-Based Tool for Real-Time Surveillance of Disease Outbreaks, Clinical Infectious Diseases 49(10):1557-64
-  S. Cook, C. Conrad, A. L. Fowlkes and M. H. Mohebbi, Assessing Google Flu Trends Performance in the United States during the 2009 Influenza Virus A (H1N1) Pandemic, 2011. [Online]. Available: DOI: 10.1371/journal.pone.0023610.
-  G. Eysenbach, Infodemiology: tracking flu-related searches on the web for syndromic surveillance, AMIA Annu Symp Proc, pp. 244-248, 2006.
-  S. Youn and H. Cho, Nowcast of TV Market using Google Trend Data, Journal of Electrical Engineering and Technology, Vol 11, pp 227-233, 2016
-  J. Rech, Discovering trends in software engineering with google trend, ACM SIGSOFT Software Engineering Notes, Vol 21, pp 1-2, 2007i
-  C. Kuo, H. Ruan and S Chen, An Analysis of Security Patch Lifecycle Using Google Trend Tool, Seventh Asia Joint Conference on Information Security, 2012.
-  H Choi and H Varian, Predicting the Present with Google Trends, the Economic Record, vol 88, pp 2-9, 2012
-  M.S. Mondal and S.A. Wasimi, Periodic Transfer Function-Noise Model for Forecasting, Journal of Hydrologic Engineering, vol 10, 2005
-  A. Seifter, A. Schwarzwalder, K. Geis and J. Aucott, the Utility of Google Trends for Epidemiological Research: Lyme Disease as an Example, Geospatial Health vol 4, pp 135-137, 2010
-  G.D. Jacobsen and K.H. Jacobsen, Health Awareness Campaigns and Diagnosis Rates: Evidence from National Breast Cancer Awareness Month, Journal of Health Economics, vol 30, pp 55-61, 2011
-  J.W. Ayers and B.M. Althouse, Leveraging Big Data to Improve Health Awareness Campaigns: A Novel Evaluation of the Great American Smokeout, JMIR Public Health and Surveillance, vol 2, 2016
-  F. Wilcoxon, Individual Comparisons by Ranking Methods, Biometrics Bulletin, vol 1, pp 80-83, 1945
-  R.C. Blair and J.J. Higgins, A Comparison of the Power of Wilcoxon’s Rank-Sum Statistic of that of Student’s t Statistic under Various Nonormal Distributions, Journal of Educational Statistics, vol 5, pp 309-335, 1980
-  ACS, Breast Cancer Facts and Figures 2011-2012.
-  SEER, Cancer Statistics Review 1975-2008-table 4.12, [Online]. Available: \(http://seer.cancer.gov/csr/1975\_2008/results\_single/sect\_04\_table.12.pdf\) (last accessed on July 30, 2018)
-  Centers for Disease Control and Prevention, Breast Cancer Statistics, 2014.
-  M. Dariush and et al., "Heart disease and stroke statistics—2015 update: a report from the American Heart Association," 2015.
-  Centers for Disease Control and Prevention and NCHS, "Underlying Cause of Death 1999-2013 on CDC WONDER Online Database," 2015.
A National Health Awareness Events with corresponding Selected Queries
|Health Awareness Event/Month||Query|
|National Birth Defects Prevention Month||Birth Defects|
|Cervical Health Awareness Month||Cervical|
|National Glaucoma Awareness Month||Glaucoma|
|Thyroid Awareness Month||Thyroid|
|American Heart Month||Heart Disease|
|National Children’s Dental Health Month||Dental Health|
|National Colorectal Cancer Awareness Month||Colon Cancer|
|National Endometriosis Awareness Month||Endometriosis|
|National Nutrition Month||National Nutrition|
|Multiple Sclerosis Education Month||Sclerosis|
|Alcohol Awareness Month||Alcohol Awareness|
|National Autism Awareness Month||Autism|
|Irritable Bowel Syndrome Month||Ibs|
|American Stroke Awareness Month||Stroke|
|Arthritis Awareness Month||Arthritis|
|National Asthma and Allergy Awareness Month||Asthma Allergy|
|National Celiac Disease Awareness Month||Celiac|
|Hepatitis Awareness Month||Hepatitis|
|National High Blood Pressure Education Month||High Blood Pressure|
|Lupus Awareness Month||Lupus|
|Mental Health Month||Mental Health|
|National Osteoporosis Awareness Month||Osteoporosis|
|Skin Cancer Detection and Prevention Month||Skin Cancer|
|Health Awareness Event/Month||Query|
|National Aphasia Awareness Month||aphasia|
|Scoliosis Awareness Month||scoliosis|
|Eye Injury Prevention Month||eye injury|
|Amblyopia Awareness Month||amblyopia|
|National Immunization Awareness Month||immunization|
|Psoriasis Awareness Month||psoriasis|
|National Alcohol and Drug Addition Recovery Month||alcohol drug addition|
|National Cholesterol Education Month||cholesterol|
|Lcukemia and Lymphomn Awareness Month||Lcukemia|
|National Menopause Awareness Month||menopause|
|Ovarian Cancer Awareness Month||ovarian cancer|
|Prostate Awareness Month||prostate|
|National Breast Cancer Awareness Month||breast cancer|
|National Dental Hygiene Month||dental hygiene|
|National Depression and Mental Health Screening Month||depression|
|National Down Syndrome Awareness Month||down syndrome|
|SIDS Awareness Month||Sids|
|Spina Bifida Awareness Month||spina bifida|
|National Alzheimer’s Disease Awareness Month||alzheimer|
|American Diabetes Month||diabetes|
|National Epilepsy Awareness Month||epilepsy|
|Lung Cancer Awareness Month||lung cancer|
|Pancreatic Cancer Awareness Month||pancreatic cancer|
B The results of three methods for all 46 query data
|High Blood Pressure||0.8289||0||No||0.0038*||Ineffective|
C ARIMA and Transfer model comparison in JMP for all 46 events
Each pair of pictures shows one event, with the left being ARIMA/SARIMA and right one being Transfer Function model.