Surveys and empirical studies have long been a cornerstone of psychological, sociological and medical research, but each of these traditional methods pose challenges for researchers. They are time-consuming, costly, may introduce a bias or suffer from bad experiment design.
With the advent of big data and the increasing popularity of the internet and social media, larger amounts of data are now available to researchers than ever before. This offers strong promise new avenues of research using analytic procedures, obtaining a more fine-grained and at the same time broader picture of communities and populations as a whole Salathé (2018). Such methods allow for faster and more automated investigation of demographic variables. It has been shown that Twitter data can predict atherosclerotic heart-disease risk at the community level more accurately than traditional demographic data Eichstaedt et al. (2015). The same method has also been used to capture and accurately predict patterns of excessive alcohol consumption Curtis et al. (2018).
In this study, we utilize Twitter data to predict various health target variables (AHD, diabetes, various types of cancers) to see how well language patterns on social media reflect the geographic variations of those targets. Furthermore, we propose a new method to study social media content by characterizing disease-related correlations of language, by leveraging available demographic and disease information on the community level. In contrast to Eichstaedt et al. (2015)
, our method is not relying on word-based topic models, but instead leverages modern state-of-the-art text representation methods, in particular sentence embeddings, which have been in increasing use in the Natural Language Processing, Information Retrieval and Text Analytics fields in the past years. We demonstrate that our approach helps capturing the semantic meaning of tweets as opposed to features merely based on word frequencies, which come with robustness problemsBrown and Coyne (2018); Schwartz et al. (2018). We examine the effectiveness of sentence embeddings in modeling language correlates of the medical target variables (disease outcome).
We are given a large quantity of text (sentences or tweets) in the form of social media messages by individuals. Each individual—and therefore each sentence—is assigned to a predefined category, for example a geographic region or a population subset. We assume the number of sentences to be significantly larger than the number of communities. Furthermore, we assume that the target variable of interest, for example disease mortality or prevalence rate, is available for each community (but not for each individual). Our system consists of two subsystems:
The predictive subsystem makes predictions of target variables (e.g. AHD mortality rate) based on aggregated language features. The resulting linear predictions are applicable on the community level (e.g. counties) or on the individual level, and are trained using k-fold cross-validated Ridge regression.
The averaged regression weights from the prediction system allow for interpretation of the system: We use a fixed clustering (which was obtained from all sentences without any target information), and then rank each topic cluster with respect to a prediction weight vector from point 1). The top and bottom ranked topic clusters for each target variable give insights into known and potentially novel correlations of topics with the target medical outcome.
In summary, the community association is used as a proxy or weak labelling to correlate individual language with community-level target variables. The following subsections give a more detailed description of the two subsystems.
2.1 System Description
Let be the set of sentences (e.g. tweets), with their total number denoted as . Each sentence is associated to exactly one of the communities (e.g. geographic regions). The function defines this mapping. Let be the target vector for an arbitrary target variable, so that each community has a corresponding target value .
Preprocessing and Embeddings.
The complete linguistic preprocessing pipeline of a sentence is incorporated by the function , which represents an arbitrary sentence as a sequence of tokens. Each sentence then is represented by a -dimensional embedding vector providing a numerical representation of the semantics for the given short text:
While our method is generic for any text representation method, here Sent2Vec Pagliardini et al. (2018) was chosen for its computational efficiency and scalability to large datasets.
2.2 Feature Aggregation
We use averaging of the sentence embedding vectors over each community to obtain the language features for each community. Formally, the complete feature matrix of all sentences is denoted as . For our approach, the sentence embedding features are averaged over each community . Formally, an individual feature of the averaged embedding for a given community is defined as
where is the number of sentences belonging to community . Consequently, the aggregated community-level embedding matrix is given by
2.3 Train-Test Split
Leveraging the targets available for each community, our regression method is applied to the aggregated features and the target . We employ -fold cross-validation: the previously defined set is split into K as equally sized pairwise disjoint subsets as possible such that: , and . The training set for a fold is with the corresponding test set , where and . The operators and uniquely map the indexes to the corresponding communities for the train-test split. For each split the train and test embedding matrices respectively are defined as
Accordingly, we define the target vectors
2.4 Ridge Regression
For each train-test split
we perform linear regression from the community-level textual featuresto the health target variable . We employ Ridge regression Hoerl and Kennard (1970). In our context, the Ridge regression is defined as the following optimization problem:
where the optimal solution is
Within each each fold we tune the regularization parameter .
2.5 Prediction Subsystem
Let be the predicted values for the test set of the split . The concatenated prediction vector for all splits is
Accordingly, we define the concatenated true target vector as
i.e., the set of individual scalars is identical to the entries in the original target vector . The predictive performance of the system can be assessed through the following metrics:
Pearson Correlation Coefficient
Mean Average Error of prediction (MAE)
Classification Accuracy for Quantile Prediction
The first two metrics are evaluated with the vectors and from all folds. In the quantile-based assessment we independently bin the true values and the predicted values into different quantiles. Each individual true and predicted value is assigned to a quantile . These assignments can be used to visually compare results on a heat-map or as regular evaluation scores in terms of accuracy.
2.5.1 Ridge-Weight Aggregation
For the final prediction model, the regression weights from Ridge regression are averaged over the folds, i.e. .
For every sentence embedding , the prediction is computed as .
2.6 Interpretation Subsystem: Cluster Ranking
We employ predefined textual topic clusters—which are independent of any target values—in order to enable interpretation of the textual correlates. Each cluster is a collection of sentences and should, intuitively, be interpretable as a topic, e.g. separate topics about indoor and outdoor activities as shown in Fig. 4. For each cluster a ranking score can be computed with respect to a linear prediction model such as defined above. Let be the set of sentences assigned to cluster . The score for the cluster is the average of all predictions within the cluster :
By ordering the scores of all clusters, we obtain the final ranking sequence of all clusters, with respect to the target-specific model .
Clustering Preprocessing. For obtaining the fixed clustering, as is a very large matrix, clustering might require subsampling to reduce computational complexity. Hence, out of the embeddings in are randomly subsampled into the set . The mapping is a uniformly random selection of row indexes in out of . We define the subsampled data matrix as .
is clustered with the Yinyang K-Means algorithmDing et al. (2015). We use
centroids and the cosine similarity as a distance function. The cluster assignment vectorassigns one cluster for each embedding in . Accordingly, the operator indicates the assigned cluster for a given sentence in (see cluster ranking above). The cluster centers are defined in .
3 Data sources
We apply the method described in Section 2 to the following setting: The pool of sentences consists of geotagged Tweets. The assigned locations are in the United States. The geotags are categorized into US-counties which represent the set of communities . The target variables are health-related variables, for example normalized mortality or prevalence rates. We focus on cancer and AHD mortality as well as on diabetes prevalence. Hence, the quantile-based predictions give a categorization of the Ridge regression predictions on a US-county level. The ranked topics assess what language might relate to higher or lower rates of the corresponding disease. Table 1 provides an overview of the size of the data sources, the year the data was collected in and the meanof the target variables. Not all counties are covered in the publicly available datasets, usually being limited to more populous counties. The collected Tweets are from 2014 and 2015. The target variables are the union-averaged values from 2014 and 2015: if the target variable is available for both years the two values are averaged. Conversely, if a county data point is only available for one, but not both years, we use this standalone value.
3.1 Datorium Tweets
Tweets are short messages of no more than 140 characters111Twitter increased the limit to 280 characters in 2017, which doesn’t affect our data. published by users of the Twitter platform. They reflect discussions, thoughts and activities of its users. We use a dataset of approximately 144 million tweets collected from first of June 2014 to first of June 2015 Datorium (2017). Each tweet was geotagged by the submitting user with exact GPS coordinates and all tweets are from within the US, allowing accurate county-level mapping of individual tweets.
3.2 AHD & Cancer Mortality
Our source of the statistical county-level target variables is the CDC WONDER222US Centers for Disease Control and Prevention - Wide-ranging Online Data for Epidemiologic Research. database CDC (2018) for AHD and cancer. Values are given as deaths per capita (100’000).
3.3 Diabetes Prevalence
We use county-wise age-adjusted diabetes prevalence data from the year 2013 CDC (2016), provided as percent of the population afflicted with type II diabetes. The data is available for almost all the 3144 US counties, making it a valuable target to use.
The results of our method for the various target variables are listed in Table 2 along with the performance of the baseline model outlined in Section 4.1. We provide the Pearson correlation () and the mean absolute error (MAE) of our system along with the baseline model’s Pearson correlation.
4.1 LDA Baseline Model
We reimplemented the approach proposed by Eichstaedt et al. (2015) as a baseline for comparison, and were able to reproduce their findings about AHD with recent data: similar results were found with the Datorium Twitter dataset Datorium (2017) and CDC AHD data from 2014 and 2015. Their approach averages topics generated with Latent Dirichlet Allocation (LDA) of tweets per county as features for Ridge regression. We do not use any hand-curated emotion-specific dictionaries, as these did not impact performance in our experiments. We used the predefined Facebook LDA coefficients of Eichstaedt et al. (2015), updated them with the word frequencies of our collected Twitter data Datorium (2017)
. Our results are computed with a 10-fold cross-validation and without any feature selection.
4.2 Detailed Results
In this section we discuss a selection of our results in detail, with additional information available in Appendix A.1.
Diabetes has a strong demographic bias, with a higher prevalence in the south-east of the US, the so called diabetes belt. Compared to the national average, the african-american population in the diabetes belt has a higher risk of diabetes by a factor of more than 2 Barker et al. (2011) and the south-east of the US has a large african-american population. Therefore, linguistic features Green (2002) common in african-american are a strong predictor of diabetes rates. The model learns these linguistic features, as seen in Figure 3, and its predictions closely match the actual geographic distribution, as seen in Figure 2. A moderate alcohol consumption is linked to a low risk of type II diabetes compared to no or excessive consumption Koppes et al. (2005). The strongest negatively correlated word clouds in Figure 3 support this finding.
In this paper, we introduced a novel approach for language-based predictions and correlation of community-level health variables. For various health-related demographic variables, our approach outperforms in most cases (Table 2) similar models based on traditional demographic data by using only geolocated tweets. Our approach provides a method for discovering novel correlations between open-vocabulary topics and health variables, allowing researchers to discover yet unknown contributing factors based on large collections of data with minimal effort.
Our findings, when applying our method to AHD risk, diabetes prevalence and the risk of various types of cancers, using geolocated tweets from the US only, show that a large variety of health-related variables can be predicted with surprisingly high precision based solely on social media data. Furthermore, we show that our model identifies known and novel risk or protective factors in the form of topics. Both aspects are of interest to researchers and policy makers. Our model proved to be robust for the majority of targets it was applied to.
For AHD risk, we show that our approach significantly outperforms previous models based on topic models such as LDA or traditional statistical models Eichstaedt et al. (2015), achieving a -value of 0.46, an increase of 0.09 over previous approaches. For diabetes prevalence our model correctly predicts its geographic distribution by identifying linguistic features common in high-prevalence areas among other features, with a -value of 0.73. For melanoma risk, it finds a high-correlation with the popularity of outdoor activities, corresponding to exposure to sunlight being one of the main risk factors in skin cancer, with an overall -value of 0.72.
One of the main limitations of our approach is the need for a large collection of sentences for each community as well as a large number of communities with target variables, leading to potentially unreliable results when this is not the case, such as for social media posts by individuals or when modeling target values which are only available in e.g. few counties. Further research is needed to ascertain whether significant results can also be achieved in such scenarios, and if robustness of our approach is improved compared to bag-of-words-based baselines Eichstaedt et al. (2015); Brown and Coyne (2018); Schwartz et al. (2018). Furthermore, all mentioned approaches rely on correlation, and thus do not provide a way to determine any causation, or ruling out of potential underlying factors not captured by the model. Even though using social media data introduces a non-negligible bias towards users of social media, our approach was able to predict target variables tied to very different age-groups, which is encouraging and supports the robustness of our approach.
Our method captures language features on a community scale. This raises the question of how these findings can be translated to the individual person. Theoretically, a community-based model as described above could be used to rank social media posts or messages of an individual user, with respect to specific health risks. However, as we currently do not have ground truth values on the individual level, and since user’s social media history has very high variance, this is left for future investigation.
Future research should also address the applicability of our model to textual data other than Twitter and potentially from non-social media sources, to communities that are not geography based, to the time evolution of topics and health/lifestyle statistics, as well as to targets that are not health related. The general methodology offers promise for new avenues for data-driven discovery in fields such as medicine, sociology and psychology.
We would like to thank Ahmed Kulovic and Maxime Delisle for valuable input and discussions.
- Barker et al. (2011) Lawrence E. Barker, Karen A. Kirtland, Edward W. Gregg, Linda S. Geiss, and Theodore J. Thompson. 2011. Geographic distribution of diagnosed diabetes in the us: a diabetes belt. American journal of preventive medicine, 40(4):434–439.
- Brown and Coyne (2018) Nicholas JL. Brown and James C. Coyne. 2018. Does Twitter language reliably predict heart disease? a commentary on eichstaedt et al.(2015a). PeerJ, 6:e5656.
- CDC (2016) CDC. 2016. County data. National Center for Chronic Disease Prevention and Health Promotion, Division of Diabetes Translation.
- CDC (2018) CDC. 2018. CDC WONDER. WONDER – Wide-ranging Online Data for Epidemiologic Research.
- Curtis et al. (2018) Brenda Curtis, Salvatore Giorgi, Anneke EK. Buffone, Lyle H. Ungar, Robert D. Ashford, Jessie Hemmons, Dan Summers, Casey Hamilton, and H. Andrew Schwartz. 2018. Can Twitter be used to predict county excessive alcohol consumption rates? PloS one, 13(4):e0194290.
- Datorium (2017) Datorium. Geotagged Twitter posts from the united states: A tweet collection to investigate representativeness [online]. 2017.
Ding et al. (2015)
Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, and Todd Mytkowicz.
K-means: A drop-in replacement of the classic K-means with consistent
ICML’15 - Proceedings of the 32nd International Conference on International Conference on Machine Learning.
- Eichstaedt et al. (2015) Johannes C. Eichstaedt, Hansen Andrew Schwartz, Margaret L. Kern, Gregory Park, Darwin R. Labarthe, Raina M. Merchant, Sneha Jha, Megha Agrawal, Lukasz A. Dziurzynski, Maarten Sap, et al. 2015. Psychological language on Twitter predicts county-level heart disease mortality. Psychological science, 26(2):159–169.
- Elwood et al. (1985) J. Mark Elwood, Richard P. Gallagher, GB. Hill, and JCG. Pearson. 1985. Cutaneous melanoma in relation to intermittent and constant sun exposure—the western canada melanoma study. International journal of cancer, 35(4):427–433.
- Green (2002) Lisa J. Green. 2002. African American English: a linguistic introduction. Cambridge University Press.
Hoerl and Kennard (1970)
Arthur E. Hoerl and Robert W. Kennard. 1970.
Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67.
- Koppes et al. (2005) Lando LJ. Koppes, Jacqueline M. Dekker, Henk FJ. Hendriks, Lex M. Bouter, and Robert J. Heine. 2005. Moderate alcohol consumption lowers the risk of type 2 diabetes: a meta-analysis of prospective observational studies. Diabetes care, 28(3):719–725.
- Pagliardini et al. (2018) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2018. Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics.
- Salathé (2018) Marcel Salathé. 2018. Digital epidemiology: what is it, and where is it going? Life sciences, society and policy, 14(1):1.
- Schwartz et al. (2018) H. Andrew Schwartz, Salvatore Giorgi, Margaret L. Kern, Gregory Park, Maarten Sap, Darwin R. Labarthe, Emily E. Larson, Martin Seligman, Lyle H. Ungar, et al. 2018. More evidence that Twitter language predicts heart disease: a response and replication.
Appendix A Appendices
a.1 Additional Figures
a.2 Implementation Details
Tweets were collected according to the provided datorium IDs using the Tweepy333https://www.tweepy.org/ library. The tweets were then imported into Google BigQuery444https://cloud.google.com/bigquery/ and processed using Apache Beam555https://beam.apache.org/. The sentence embeddings were computed using the official Sent2Vec source code and the provided 700-dimensional pre-trained model for tweets (using bigrams)666https://github.com/epfml/sent2vec. Clustering was performed by libKMCUDA777https://github.com/src-d/kmcuda. Scikit-learn888https://scikit-learn.org/stable/
was used for 10-fold cross validation, Ridge regression, calculating the correlation and hyperparameter search.