Analysis of Twitter has become a widespread approach for geo-spatial studies of human behavior, such as alcohol consumption [Kershaw, Rowe, and Stacey2014, Culotta2013] and exercise [Young2010], and human latent states, such as sickness [Paul and Dredze2011, Sadilek, Kautz, and Silenzio2012a, Sadilek et al.2013] and depression [Dos Reis and Culotta2015, Nambisan et al.2015, Tsugawa et al.2015]. However, nearly all prior work, with the notable exception of [Lamb, Paul, and Dredze2013]
, does not attempt to distinguish mere mentions of activities or states from self-reports of activity. Moreover, no attempt has been made to distinguish reports about future or past activities and in-the-moment reports that provide finer details when geo-tagged tweets are used to map specific locations of activities. Further insights into the geo-location of activities can be obtained by inferring the home locations of the subjects involved. Home location helps analyze the number of members of a community engaging in an activity, the kinds of places where the activity occurs (e.g., home, commercial establishment, public place, etc.
), and the distance people travel from home to participate in it. Prior research has used simple heuristics for predicting a social media user’s home location, such as the place from which the user most frequently tweets, or the most common last location of the day for the user’s posts[Pontes et al.2012a, Pontes et al.2012b, Cho, Myers, and Leskovec2011, Scellato et al.2011]. But such heuristics are inaccurate for a large percentage of users, e.g., in cases when users frequently visit multiple places.
We apply machine learning techniques on Twitter content to identify in-the-moment reports of user behaviors and to accurately predict users’ home locations within 100 meters. Using these tools, we develop new methods for a task of critical interest in public health: discovering patterns of alcohol use in urban and suburban settings. Such methods can help us better understand the occurrence, frequency, and settings of alcohol consumption, a health-risk behavior, and can lead to actionable information in prevention and public health.
Excessive alcohol use has a tremendous negative impact on health and communities. Drinking directly results in about 75,000 deaths annually in the US, making it the nation’s third leading cause of preventable death [Centers for Disease Control and Prevention and others2004]. Previous research [Kuntsche et al.2005, Naimi et al.2003] shows that social factors play an important role in developing drinking patterns over time. While social media such as Twitter is both ubiquitous and publicly available, little research has investigated the relationship between virtual social contexts and the alcohol referencing or alcohol-linked behaviors found there in various real-world community settings.
In this paper, we aim to predict where Twitter users are when they report on drinking. We report on several stages of work to accomplish this research objective. First, we collected geo-tagged tweets from urban, suburban, and rural areas of New York State. Using human computation, we created a training set that captures important details such as whether the tweet mentions drinking alcohol, the user drinking, or the user drinking at the time of tweeting. We created a hierarchy of three support vector machine (SVM) classifiers[Burges1998]
to distinguish tweets up to these fine details. Each of these SVMs achieves an F-score aboveand is used to classify tweets from New York City and from Monroe County, a predominantly suburban area in upstate New York containing one medium-sized city (Rochester), in order to develop methods that can perform in “big city” as well as “small city” contexts of social media use.
We also performed fine-grained home location inference of Twitter users to generate community descriptions, such as to calculate the proportion of “social media drinkers” drinking at home, and to analyze how far people travel from home to drink-and-tweet. Existing home inference methods either rely on continuous and expensive GPS data, covering a small number of users, or suffer from poor accuracy. We trained an SVM classifier to predict home location for active users (users with as little as 5 geo-tagged tweets) within 100 by 100 meter grids. Considering the sparse and noisy nature of Twitter data that poses serious challenges in pinpointing where people live, our classifier achieves a high accuracy of 70%, covering 71% active users in New York City. We also investigated ways to balance granularity and coverage. Prior work on home location has been limited to localizing at the city level; ours is the first to achieve block-level accuracy.
Latent States & Activities from Social Media
Most prior work on using Twitter data about users’ online behavior has estimated aggregate disease trends in a large geographic area, typically at the level of a state or a large city. Researchers have examined influenza tracking[Culotta2010, Achrekar et al.2012, Sadilek and Kautz2013, Broniatowski and Dredze2013, Brennan, Sadilek, and Kautz2013], mental health and depression [Golder and Macy2011, De Choudhury et al.2013], as well as general public health across a broad range of diseases [Brownstein, Freifeld, and Madoff2009, Paul and Dredze2011]. Some researchers have begun modeling health and contagion across individuals [Ugander et al.2012, White and Horvitz2008, De Choudhury et al.2013]. For example, [Sadilek, Kautz, and Silenzio2012b] showed that Twitter users who exhibit symptoms of influenza can be accurately detected using a language model based on word trigrams. A detailed epidemiological model can be subsequently built by following the interactions between sick and healthy individuals in a population, where physical encounters are estimated by spatio-temporal collocated tweets. nEmesis [Sadilek et al.2013] scored restaurants in New York City by the number of Twitter users who posted status updates from a restaurant and within the next several days posted self-reports of symptoms of food poisoning. Our hierarchical classifiers use the same kind of word-trigram features at each level.
Little prior work has attempted to distinguish true in-the-moment self-reports on Twitter from more general discussion of a condition or activity. A notable exception is [Lamb, Paul, and Dredze2013]
, which explored language models that could distinguish discussion of the flu from self-reports. This work enriched the set of n-gram language features by including manually-specified sets of words, features for hashtags and retweets, and various syntactic patterns. For separating general discussion from reports of some particular person being sick, n-grams were most important, followed by the manually-specified word classes. For separating reports of the user being sick from reports of others being sick, n-grams were again most important, by the hashtag/retweet features. The overall success of n-grams supports our n-gram based approach for latent activity detection. The authors did not use hierarchical classifiers or attempt to distinguish in-the-moment-reports from those about the past or future.
Despite the huge public health costs exacted by alcohol use, commercial interests and individuals, for example, teens [Moreno et al.2009, Egan and Moreno2011] do post about alcohol and drinking in social media. Alcohol-related posts are seen as credible reports by teens and thus posts can influence perceived social norms, a factor linked to the uptake of drinking behaviors [Litt and Stock2011].
In the case of alcohol use, social context certainly matters. For instance, survey research shows that having close friends that drink heightens alcohol use and perceptions about alcohol use in teen life, in general [Jackson et al.2014, Polonec, Major, and Atwood2006]. Peer alcohol consumption behavior of one’s social network, particularly those of relatives and friends (not immediate neighbors and co-workers), is a risk factor for alcohol use, specially among adolescents [Rosenquist et al.2010, Ali and Dwyer2010].
When the geography of one’s daily life creates proximity to alcohol (i.e., greater spatial/temporal availability of on-premise or off-premise alcohol outlets, etc.), a well-documented risk factor for alcohol use and its array of related adverse public health consequences emerges [Campbell et al.2009, Weitzman et al.2003, Holmes et al.2014, Scribner et al.1999, Scribner et al.2008, Livingston2008a, Livingston2008b, Livingston2011, Kypri et al.2008, Chen, Grube, and Gruenewald2010, Scribner, MacKinnon, and Dwyer1994, Zhu, Gorman, and Horel2004, Britt et al.2005, Liang and Chikritzhs2011]. Modifying proximity is often explored as a public health policy means to reduce alcohol use, for instance, in neighborhoods [Sparks, Jernigan, and Mosher2011]. However, the association between neighborhood alcohol outlet density and percentage of alcohol consumers may be more complex due to variation in travel patterns and neighborhood styles, and mediated by proximity to one’s home (e.g., within one-mile) [Schonlau et al.2008].
Social media is a new ubiquitous source of real-time community and individual public-health related behaviors. When seeking to apply social media to detect the social media ecology of health behaviors such as alcohol use, it is important to identify not only whether but where (the settings in which) the mentions or posts are occurring. As both geo-physical and virtual access to rapidly diffused messages about alcohol and its use may heighten risky drinking and related behaviors, methods are needed to permit the study of these potentially interacting influences. Such methods can reveal different risk patterns associated with different locations not prior known, and help inform more localized or targeted intervention strategy development. For instance, as social network structures are observable in social media, and as “neighbor” attributes can influence drinking behavior among online friends or followers, studying network influence in social media settings like Twitter may illuminate drinking risk patterns not previously known.
However, current methods for examining these influences are very limited. Methods for detecting problematic alcohol use in communities are typically opportunity or survey based (e.g., driver check-points, community surveys, ED admissions, or health care-based screenings), not often scalable to population levels due to resource restrictions. Research on how to vivify a community’s raft of social media posts to detect its alcohol use patterns is only now starting to emerge. For instance, [Tamersoy, De Choudhury, and Chau2015]
distinguished long-term versus short-term drinking/smoking abstinence from the social media site Reddit. These researchers were able to use linguistic features from content posted, and social interaction features derived from users’ network structure through the application of supervised learning. In this paper, we propose new automated methods for identifying both whether and where self-reports of drinking are occurring among Twitter users in two major metropolitan regions of New York State.
Home Location Detection
With the knowledge of home locations, we can gain a better insight to human mobility patterns, as well as lifestyle in general. In [Scellato et al.2011, Cho, Myers, and Leskovec2011, Scellato, Noulas, and Mascolo2011], home location is the key origin to calculate the distance that people travel and to estimate the distance between social network users in a pairwise fashion. Home location has also been used to model individuals’ living conditions and lifestyles [Sadilek and Kautz2013]. We organize the discussion of related work on home location prediction by the type of data used.
There has been much prior work on using language features in non-geotagged social media posts to predict the home locations of users at a coarse grain, at the level of a city or state. In [Mahmud, Nichols, and Drews2012], linguistic features and place names from tweets were used to create a classifier that infers home locations at city, state and time zone levels in the top 100 most populated US cities with accuracies of 58%, 66%, and 78% respectively. This suggests that language models are not good for fined-grained home localization (in our case, within 100 meters). Similar results, accurate at most to several kilometers, appear in [Pontes et al.2012a]. In [Cheng, Caverlee, and Lee2010], the authors used a content-based method to detect Twitter users’ home cities, placing of active users within 100 miles of their actual home locations.
Others developed “single-attribute” models based on different social network features, for example, taking the value of users’ “Employment” as their home cities in Google+, or using geo-tags in FourSquare, Google+, and Twitter posts to predict the city.Geo-tagged Foursquare data was used in [Pontes et al.2012b] to infer home cities within 50 kilometers with 78% of user coverage. A dataset containing the traces of 2 million mobile phone users from a European country was used in [Cho, Myers, and Leskovec2011] to estimate home locations based on the places with most check-ins. The paper reported that by manual checking, the most check-ins method achieved 85% accuracy when the area was divided into 25 by 25 km cells.
Other researchers used simple heuristics to select the home location from the set of locations in a user’s geo-tagged posts. The most popular heuristics are to assume that the location with the most check-ins is home [Scellato et al.2011], or to assume that the common last location of the day from which one tweets is home [Sadilek and Kautz2013]. The accuracy and coverage of such heuristic approaches was not reported. We discovered that these prior methods individually suffered from low accuracy and/or coverage. For example, the most check-ins approach performs poorly when a user visits several places with similar frequencies.
Wearable GPS and Diary Data
GPS and diary data make home detection more precise and easier because they are more dense and continuous than social network location data, but they are more expensive to obtain, resulting in low population coverage when used in locating homes. In [Krumm2007], a device recorded location coordinates every several seconds when the car was moving on 172 subjects’ vehicles. The subjects reported the ground truth of their homes. The authors then used 4 heuristic algorithms to compute the coordinates of each subject’s home, and found that the best one was “last destination of a day”. The median distance error of their best algorithm was 60.7 meters. In [Hoh et al.2006], the researchers performed agglomerative clustering on the GPS traces of users until the clusters reached an average size of 100 meters. Next they manually eliminated clusters with no recorded points between 4PM and midnight and those falling outside the residential areas.
Semantically labeling places is another important topic related to home location detection. In [Sadilek and Krumm2012], the authors used GPS data from 307 people and 396 vehicles, then divided the world into 400 by 400 meter grids, and assigned each GPS reading to the nearest cell. They found that the top 10 frequently visited locations can usually be semantically labeled as “home”, “work”, “favorite restaurant” and so on. Other researchers [Krumm and Rouhana2013] performed experiments using two diary datasets — American Time Use Survey and the Puget Sound Regional Council Household Activity Survey — where each location had a semantic label such as “home” or “school”. They extracted several features of a location and trained place classifiers using machine learning, reporting a classification accuracy above 90% on locations labeled as “home”.
Alcohol Usage Detection
We now describe our methods for detecting geo-temporal alcohol consumption via Twitter. We discuss the data preparation steps, the hierarchical classification approach, the strategies we employed to reduce classifier overfitting and the results.
We collected geo-tagged tweets from urban, suburban and rural areas in New York State from July 2013 to July 2014. Similar to the approach used in [Paul and Dredze2011], we began the process of creating a training dataset by first filtering tweets if they included a mention of alcohol, defined by the inclusion of any one of several drinking-related keywords (e.g., “drunk”, “beer”, “party”) and their variants. The word set was reviewed and modified with local community member input from our social media analytic advisory group, the Big Data Docents.
We were interested in labeling each tweet that passed this filter by applying a hierarchy of three yes/no feature questions, as follows:
Does the tweet make any reference to drinking alcoholic beverages?
if so, is the tweet about the tweeter him or herself drinking alcoholic beverages?
if so, is it likely that the tweet was sent at the time and place the tweeter was drinking alcoholic beverages?
We labeled this Alcohol dataset111dataset and keywords available in: cs.rochester.edu/u/nhossain/icwsm-16-data.zip using the Amazon Mechanical Turk222http://www.mturk.com. Given a tweet, a turker was asked Q1, and only if the turker answered “yes”, then he/she was asked Q2, and so on. Each question was passed to three Turkers and the answer choices were “yes”, “no”and “not sure”. Tweets that didn’t receive consensus in turker ratings ( (yes/no) agreement among less than two turkers) were discarded from the dataset. The remaining tweets were labeled ‘1’ if two or more turkers answered “yes”, otherwise they were labeled ‘0’ for each feature question. Since for each tweet the questions were asked hierarchically, the approach resulted in a smaller ground truth for deeper questions, as Table 1 shows.
|Class size (0, 1)||2321, 3238||579, 2044||642, 934|
Tweet texts are usually conversational texts, noisy and unstructured, making it difficult to create a good feature set using them. We performed several pre-processing techniques to reduce lexical variation in tweets. These include converting hyperlinks to “url”, mentions to “mention”, emoticons to positive and negative emoticon features, using hashtags as distinct features, and truncating three or more consecutive occurrences of a character in a word to two consecutive occurrences (e.g. “druuuuuuunk” “druunk”). Using the pre-processed tweets and their labels, we created separate trigram linguistic feature sets for the three questions. In order to reduce overfitting, we only kept the top most-frequent features, where of the size of the training set size for the corresponding question.
|neg. features||weights||pos. features||weights|
|drunk in love||-0.593||alcoholic||0.749|
|in love||-0.52||get wasted||0.715|
For each of the three questions, we trained a linear support vector machine (SVM) to predict the answer. As shown in Figure 1, these SVMs are hierarchical [Koller and Sahami1997]. For example, the data for SVM-2 (SVM for question Q2) include only the tweets labeled by SVM-1 as “yes” and for which consensus was reached by turkers for Q2. This restricts the dataset distribution as we go down the hierarchy. Compared to a single flattened multi-class classifier, hierarchical classifiers are easier to optimize, and because they have a restricted feature set, we can build more complex models without overfitting. This way of classifying tweets is also more intuitive and suits our purposes. In other words, SVM-1 will be specialized to filter drinking-related tweets, while SVM-3 assumes that the input tweet is about drinking and particularly the tweeter drinking, and decides whether the tweeter was drinking when he/she posted the tweet.
For each SVM, we used of the labeled data for training and the remaining
for testing. We applied 5 fold cross validation to reduce overfitting and used the F-score for model selection. The F-score, ranging between 0 and 1, is the harmonic mean of precision and recall, and the higher the score the lower the classification error.
|neg. features||weights||pos. features||weights|
|my mom||-0.623||get drunk||0.301|
The results in Table 1 show high precision and recall for each question. They also suggest that the more detailed the question becomes, the harder it gets for the classifier to predict correctly. This is not unexpected because intuitively we expect Q3 to be a harder question to answer compared to Q1. More importantly, our hierarchical classification approach shrinks the training data as we go down to deeper questions, most likely making it difficult for the classifiers down the hierarchy to learn from the smaller data. However, we believe that this approach is better than a multi-class SVM approach which, although would use the full training data to answer each question, does not have the advantage of restricting the data distribution to focus on the question. For example, Table 2 shows that SVM-1 mainly uses features related to alcoholic drinks to determine whether the tweet is related to drinking alcoholic beverages. SVM-2 distinguishes self-reports of drinking from general drinking discussion by using pronouns and implicit references to drinking, as Table 3 suggests. Table 4 shows that, having known that the tweet is related to the user drinking alcohol, SVM-3 identifies drinking in-the-moment using temporal features (e.g., “hangover”, “last night”, “now”) and features related to the urge to drink (e.g., “need”, “want”).
|neg. features||weights||pos. features||weights|
|when||-0.617||bottle of wine||0.387|
Home Location Prediction
Existing home inference methods suffer from either low coverage (GPS & diary data) or coarse granularity and low accuracy (language models and prior work on geo-tagged data), making them inadequate for problems that require both high coverage and fine granularity. Our more sophisticated machine learning based algorithm combines a number of different features describing each user’s daily trajectories as determined from geo-tagged tweets, thus predicting users’ home locations from sparse tweets with high accuracy and coverage. We now describe our method for home location prediction of Twitter users, the creation of a labeled training data, the feature set, our results, and we evaluate our system.
Dataset & Pre-Processing
We collected geo-tagged tweets sent from the greater New York City area during July 2012 and from the Bay Area during 06/01/2013 - 08/31/2013. A typical geo-tagged tweet contains the ID of the poster, the exact coordinates from where the tweet was sent, time stamp, and the text content. Due to the inherent noise in the geo-tags, we split the areas into 100 by 100 meter grids and treat the center of each grid as the target of home detection. Each tweet is assigned to its closest grid, and every time a user’s tweet appears in a grid we say the user has a check-in in this grid. Similar to previous work [Song et al.2010, Smith et al.2014, Lin, Hsu, and Lee2012], we only focus on users who have sent at least 5 geo-tagged tweets, and we call them active users. Also following these studies, we take each user’s hourly traces (only one location for each hour in our sampling duration) instead of using every single check-in. Thus, if a user appears in several unique grids in an hour, we take the grid with the highest number of check-ins as the user’s location for the hour (ties are broken by random selection). If a user’s location is not observed in an hour, the location for that hour is set to “Null”. Typically, the hourly traces of a user form a sparse vector, for example, , and the size of is the number of hours in the sampling period. We provide a snapshot of our dataset in Table 5.
|No. of tweets||2,636,437||3,633,712|
|Total no. of active users||55,237||53,314|
|No. of tweets annotated by AMT||5,000||5,000|
|No. of ground-truth homes||1,063||987|
Obtaining fine-grained ground truth is challenging because it involves identifying a Twitter user’s home from several locations the user checked-in without being told by the user. Some researchers relied on information from user profiles [Pontes et al.2012a, Pontes et al.2012b, Mahmud, Nichols, and Drews2012], others manually inspected the detection results [Cho, Myers, and Leskovec2011]. However, the location information in user profiles is coarse (at city level), while manual inspection is not scalable. Reading a tweet that says “Enjoying the beautiful conference room view!”, a human can tell that it was sent from a workplace. Tweets such as “finally home!” or “home sweet home” are most likely sent from home. Thus, we relied on tweet content and human intelligence to build the ground truth for home location.
We asked faithful Twitter users what they would like to post when at home. Based on their answers, we selected a set of 50 keywords (e.g., “home”, “bath”, “sofa”, “TV”, “sleep”, etc.) and their variants which are likely to be mentioned in tweets sent from home. Next, we filtered tweets that contained at least one of these keywords. Then, we relied on Amazon Mechanical Turk to find the tweets sent from home. Each turker was given a questionnaire containing 5 tweets to answer. For each tweet we asked: “is this tweet sent from home?”, and the options were “yes”, “no” and “not sure”. Each questionnaire was answered by three unique turkers. We only retained the tweets which, all three turkers believed, were sent from home.
Features Based on Human Mobility
Previous work using linguistic features from tweet content [Mahmud, Nichols, and Drews2012, Cheng, Caverlee, and Lee2010] did not achieve good accuracy in granular settings, and even in course-grained conditions these methods required over a few hundred tweets per user to obtain reasonable accuracy. Our goal is to predict homes for users with as little as 5 tweets to increase coverage. Therefore, instead of using linguistic features, we extract features that capture temporal and spatial properties of homes. Although some of these features alone (e.g. check-in frequency, PageRank score) can be used as reasonable baseline methods to detect homes, we show that combining features appropriately using a machine learning method brings significant gain in both accuracy and coverage. We now discuss how we obtain these features from a user’s hourly traces and how they capture inherent properties of home.
As we discussed earlier, taking the location of most check-ins as home is a popular method. Throughout the paper, we refer to this method and the corresponding feature as “Most Check-in”. Although check-in based methods for home detection work well on GPS data [Krumm2007], they perform much worse on Twitter data. This is because GPS devices keep recording locations every few seconds whereas the frequency of a user’s geo-tagged tweets are low and largely vary based on the type of user. The location with most check-ins definitely is important to a user, but that does not necessarily mean it is the home.
For user , we define the margin between two locations of check-ins and as , where and are percentages of ’s check-ins at and respectively. Figure 2 shows that for a user, the lower the margin between the most check-in location and the second most check-in location, the less effective is the Most Check-in feature as an accurate predictor of home. For instance, the accuracy is 70% only when this margin is 50 or higher. Figure 3 shows that only a small number of users have large margins between most check-in and second most check-in locations (e.g., only about 20% of the users have margins above 70, which means that home detection accuracy for these users using Most Check-in method is about 80%, according to Figure 2).
These explain why the Most Check-in method performs poorly in fine-grained settings — for example, as the grid with most check-in shrinks from 1 by 1 kilometer block to many 100 by 100 meter grids, the most check-in percentage spreads over many of these smaller grids, lowering the margin between the new most check-in location and the new second most check-in location. To circumvent this problem, we extract 3 features for each location checked-in by a Twitter user :
the percentage of check-ins of at location
the margin between and those of its immediate higher and lower most check-in locations
Check-in Frequency During Late Night
Intuitively, the places people check-in at late night are probably their homes. For example,[Sadilek and Kautz2013] estimated a person’s home by taking the mean of a two-dimensional Gaussian fitted to the person’s check-ins between 1AM and 6AM. This method potentially alleviates the biases caused by other frequently visited places during daytime. Thus, for each location visited by a user, we take the check-in percentage of that location computed over a restricted time period of 12AM - 7AM as a feature, which we define as the late night feature of that location.
Last Destination of a Day
According to research using GPS data [Krumm2007], the last destination of a person on a day (no later than 3AM) is most likely the home, highlighting that people’s daily movements end at their homes. Based on this assumption we extract a mobility feature, which we call the last destination feature. For each location visited by a user, we count the number of times the location had been the last destination of the day, and we add this count to our feature set.
Last Destination with Inactive Late Night
Since “last destination” might suffer from check-ins sent from non-home places (e.g., when the night was spent outside), we add to our feature set a variant of last destination. We only consider tweets sent on the days when people were inactive during late night (12AM - 7AM). We exclude the days with active late night and, for each place visited by the user, we count the number of times the place had been the last destination in the remaining days.
The original check-in feature has limitations in obtaining a broader coverage in detecting homes. The above three features introduce extra human behaviour information to the simple check-in feature and help reduce this limitation.
According to [Krumm2007], the probability of being at home varies over time. For each place checked-in by a user, we compute the check-in percentages in that place at each hour of the day over the sampling period, and we add these 24 values (which sum to 100%) to our feature set. These time related features help us capture the property of home in terms of temporal patterns.
Home is a crucial start/end point of many of our movements. Thus, for each place we add 2 more features — weighted PageRank [Xing and Ghorbani2004] and Reversed PageRank scores — to model how importantly a place behaves as an origin and a destination. To apply PageRank, we first transfer a Twitter user’s trace into a directed graph called the movement graph, in which the vertices are the locations visited by the user and a directed edge from vertex to represents that location is visited directly from . To quantify the certainty and importance of transitions between locations, we assign a weight to each edge. The weight should be proportional to the number of times a transition appears in the user’s trace, and inversely proportional to the number of idle hours during the transition. Thus, assuming that is the set of hourly traces of a user over the sampling period, the weight is the ratio of the total number of transitions from to in to the total number of idle hours during all these transitions.
After constructing a user’s movement graph, we apply PageRank to calculate, for each visited location, the importance of that location as a destination. To study the importance of that location as an origin, we calculate the Reversed PageRank score by reversing each edge direction in the movement graph (edge weights remain unchanged), and then applying weighted PageRank. The PageRank and Reversed PageRank scores describe the spatial characteristics of movements.
SVM Training and Home Location Evaluation
We trained a linear SVM classifier using all these features to capture important feature combinations that better distinguish homes. Each training datapoint is a tweet identified uniquely by user ID and location ID, labeled “home” or “not home”, having 32 feature values calculated from the user’s hourly traces. For each Twitter user, the classifier outputs a score for all the places the user checked-in from. If the place with the highest score exceeds a threshold, it is marked as the user’s home. Otherwise, the user’s home is marked “unknown”, which decreases our home detection coverage. Table 6 shows the most significant SVM features.
|Margin between top two check-ins||0.19|
|Last destination with inactive late night||0.12|
|Reversed PageRank score||0.09|
|Margin below next higher check-in||-0.30|
|Margin under next higher PageRank||-0.28|
|Margin under next higher Reversed PageRank||-0.21|
|Rank of Reversed PageRank||-0.07|
|Rank of PageRank||-0.07|
Accuracy vs. Coverage
To guarantee the practicality of our home detection method, we need to balance granularity and coverage. Because of the natural trade-off between granularity and detection accuracy, we fix the granularity to 100 by 100 meter grid and explore the relationship between accuracy and coverage. The accuracy can be adjusted by varying the threshold, which also affects coverage.
Figure 4 shows how our methods compare with three other single-feature based methods in terms of accuracy and coverage. The tuning parameter for PageRank (and Reversed PageRank) scores was the extent to which the highest PageRank Score was larger than the second highest one, and for Most Check-ins it was the check-ins count. Homes were not predicted using Most Check-ins when the most check-in count was less than 3. At every accuracy level, our method covers more homes than other methods, suggesting that a combined model significantly increases coverage over single-feature based models. Particularly, when we set the accuracy of each method to 70% (which we think is acceptable for urban computing), our classifier obtains 71% and 76% coverage in NYC and Bay Area respectively, significantly higher than those achieved using individual features.
Since we performed home detection to 100 by 100 meter grids, the resolution of this grid-based method is around 70 meters ( m). We explore how resolution affects our method’s accuracy by setting coverage at 80% and varying the resolution from 100 meters to 1000 meters. Figure 5 shows that increasing the resolution increase the accuracy although the rate of increase of accuracy slows down and peaks at around 80%. Compared to previous work [Pontes et al.2012a], our method provides higher resolution with similar accuracy ( 80%).
Analysis of Alcohol Consumption via Twitter
In this section, we discuss the results obtained by applying our SVMs on geo-tagged tweets from New York City (dataset range: 11/19/2012 - 03/31/2013) and from Monroe County in upstate New York (dataset range: 07/03/2014 - 04/27/2015). We specifically chose these datasets to study alcohol consumption in urban (NYC) vs suburban (Monroe) settings. We analyze drinking at home vs. away from home, and we investigate the relationship between the density of tweets sent from different regions while intoxicated and the density of alcohol outlets in those regions. The following terms will be used throughout this section:
drinking-mention: SVM-1 predicts “yes”
user-drinking: SVM-2 predicts “yes”
user-drinking-now: SVM-3 predicts “yes”
We ran the set of NYC and Monroe tweets in the order shown in Figure 1. The results in Table 7 show that for each drinking-related question, NYC has a higher proportion of tweets marked positive compared to the corresponding proportion in Monroe County. One possible explanation is that a crowded city such as NYC with highly dense alcohol outlets and many people socializing is likely to have a higher rate of drinking happening at a time compared to a suburban area such as Monroe county with low population and alcohol outlet density.
|No. of geo-tagged tweets||1,931,662||1,537,979|
|Passed keyword filter||51,321||26,858|
|Correlation with outlet density||0.390||0.237|
Figure 6 shows the zoomed geographic distributions333obtained using CartoDB — http://cartodb.com/ of user-drinking-now tweets via normalized heat maps. These maps were constructed by splitting the geographic area for each dataset into 100 by 100 meter grids, then computing the proportion of tweets in each grid that were user-drinking-now (excluding grids that had less than 5 user-drinking-now tweets), and using these values as the degree of “heat”. That is, the grids with “more heat” are those where the proportion of in-the-moment drinking tweets compared to the total geo-tagged tweets are much higher. We believe that such grids are regions of unusual drinking activities.
We also computed the alcohol outlet densities444obtained from NYS LAMP — lamp.sla.ny.gov/ for the grids and then calculated the correlation between the alcohol outlet density and the density of user-drinking-now tweets. As Table 7 shows, the density of user-drinking-now tweets in both our datasets exhibit positive correlations with alcohol outlet density, with -values less than . Although correlation does not necessarily imply causation, these results agree with several prior work [Campbell et al.2009, Sparks, Jernigan, and Mosher2011, Weitzman et al.2003, Scribner et al.2008, Kypri et al.2008, Chen, Grube, and Gruenewald2010] which claim that alcohol outlet density influences drinking.
The ability to detect homes and locations where user-drinking-now tweets are generated enables us to compare drinking going on at home vs. not at home. For this purpose, we only used homes predicted with at least 90% accuracy which resulted in some loss of coverage (see Figure 4). We filtered all Twitter users with homes in our datasets and extracted all the user-drinking-now tweets posted by these users. For these tweets, we plotted the histogram of distance from home, shown in Figure 7. We see that NYC has a larger proportion of user-drinking-now tweets posted from home (within 100 meters from home) whereas in Monroe County a higher proportion of these tweets generated at driving distance (more than 1000 meters from home).
Discussion and Future Work
We proposed a machine learning based model for detecting latent activities and user states via Twitter to such fine details that have not been distinguished yet. The model not only distinguishes people discussing an activity vs. discussing themselves performing the activity, but also determines whether they are performing it at-the-moment vs. past/future. We showed the strength of our model by applying it to the detection of alcohol consumption as an example application. Coupled with our other contribution of home location prediction, the model allows us to study Twitter users’ drinking behavior from several community or ecological viewpoints built from the fine-grained location information extracted.
Models that permit the fine-grained study of alcohol consumption in social media can reveal important real-time information about users and the influences they have on each other. We can begin to evaluate the merits of these data for public health research. Such analyses can teach us who is and isn’t referencing alcohol on Twitter, and in what settings, to evaluate the degree of self-reporting biases, and also help to create a tool for improving a community’s health, given social networks can become a resource to spread positive health behaviour. For instance, the peer social network “Alcoholics Anonymous”555http://www.aa.org/ is designed to develop social network connections to encourage abstinence among the members and establish helpful ties.
Although we apply home localization to describe a geographical community portrait of drinking referencing patterns among its social media users, since people spend a large portion of their time at home, our model enables a wide range of applications that were previously impractical. For instance, we can analyze human mobility patterns; we can study the relationship between demographics, neighborhood structure and health conditions in different zip codes, thus understanding many aspects of urban life and environments. Research in these areas and alcohol consumption is mainly based on surveys and census, which are costly and often incur a delay that hamper real-time analysis and response. Our results demonstrate that tweets can provide powerful and fine-grained cues of activities going on in cities.
While Twitter use is ubiquitous, its users are not a representative sample of the general population; it is known to include more young and minority users [Smith2011]. Bias, however, is a problem in any sampling method. For example, surveys under-represent the segment of the population that is unwilling to respond to surveys, such as undocumented immigrants. Statistics estimated from Twitter (or any other source) can be adjusted to account for known biases by weighting data appropriately. While addressing Twitter’s bias is beyond the scope of this paper, our methods can permit further work in this area by locating users in communities with fine-grained detail, meaning more fine-grained demographic data becomes available for linkage. We also note that the average sampling rate of US Census in each state is about 3% [U.S. Census Bureau2011], which is similar to the percentage of users we covered out of all the Twitter users.
Our future work will perform a comprehensive study of alcohol consumption in social media around features such as user demographics, settings people go to drink-and-tweet (e.g., friends’ house, stadium, park), etc. We can explore the social network of drinkers to find out how social interactions and peer pressure in social media influence the tendency to reference drinking. Another interesting study is to compare the rate of in-flow and out-flow of drinkers in adjacent neighborhoods. All these analyses will help us understand the merits of these methods for analyzing drinking behavior, via social media, at a large-scale with very little cost, which can lead to new ways of reducing alcohol consumption, a global public health concern. Finally, our models are broadly applicable to various latent activities and make way for future work in many other domains.
Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM108337, the National Science Foundation under Grant No. 1319378 and the Intel ISTCPC. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and the NSF. The authors thank members of the Big Data Docents, our community collaborative research board, for their guidance in this scientific work.
- [Achrekar et al.2012] Achrekar, H.; Gandhe, A.; Lazarus, R.; Yu, S.; and Liu, B. 2012. Twitter improves seasonal influenza prediction. Fifth Annual International Conference on Health Informatics.
- [Ali and Dwyer2010] Ali, M. M., and Dwyer, D. S. 2010. Social network effects in alcohol consumption among adolescents. Addictive behaviors 35(4):337–342.
[Brennan, Sadilek, and
Brennan, S.; Sadilek, A.; and Kautz, H.
Towards understanding global spread of disease from everyday
Twenty-Third International Conference on Artificial Intelligence (IJCAI).
- [Britt et al.2005] Britt, H. R.; Carlin, B. P.; Toomey, T. L.; and Wagenaar, A. C. 2005. Neighborhood level spatial analysis of the relationship between alcohol outlet density and criminal violence. Environmental and Ecological Statistics 12(4):411–426.
- [Broniatowski and Dredze2013] Broniatowski, D. A., and Dredze, M. 2013. National and local influenza surveillance through twitter: An analysis of the 2012-2013 influenza epidemic. PLoS ONE 8(12).
- [Brownstein, Freifeld, and Madoff2009] Brownstein, J. S.; Freifeld, B. S.; and Madoff, L. C. 2009. Digital disease detection - harnessing the web for public health surveillance. N Engl J Med 260(21):2153–2157.
Burges, C. J.
A tutorial on support vector machines for pattern recognition.Data mining and knowledge discovery 2(2):121–167.
- [Campbell et al.2009] Campbell, C. A.; Hahn, R. A.; Elder, R.; Brewer, R.; Chattopadhyay, S.; Fielding, J.; Naimi, T. S.; Toomey, T.; Lawrence, B.; Middleton, J. C.; et al. 2009. The effectiveness of limiting alcohol outlet density as a means of reducing excessive alcohol consumption and alcohol-related harms. American journal of preventive medicine 37(6):556–569.
- [Centers for Disease Control and Prevention and others2004] Centers for Disease Control and Prevention and others. 2004. Alcohol-attributable deaths and years of potential life lost–united states, 2001. MMWR: Morbidity and mortality weekly report 53(37):866–870.
- [Chen, Grube, and Gruenewald2010] Chen, M.-J.; Grube, J. W.; and Gruenewald, P. J. 2010. Community alcohol outlet density and underage drinking. Addiction 105(2):270–278.
- [Cheng, Caverlee, and Lee2010] Cheng, Z.; Caverlee, J.; and Lee, K. 2010. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 759–768.
- [Cho, Myers, and Leskovec2011] Cho, E.; Myers, S. A.; and Leskovec, J. 2011. Friendship and mobility: user movement in location-based social networks. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 1082–1090.
- [Culotta2010] Culotta, A. 2010. Towards detecting influenza epidemics by analyzing Twitter messages. In Proceedings of the First Workshop on Social Media Analytics, 115–122. ACM.
- [Culotta2013] Culotta, A. 2013. Lightweight methods to estimate influenza rates and alcohol sales volume from twitter messages. Language resources and evaluation 47(1):217–238.
- [De Choudhury et al.2013] De Choudhury, M.; Gamon, M.; Counts, S.; and Horvitz, E. 2013. Predicting depression via social media. AAAI Conference on Weblogs and Social Media.
- [Dos Reis and Culotta2015] Dos Reis, V. L., and Culotta, A. 2015. Using matched samples to estimate the effects of exercise on mental health from twitter. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
- [Egan and Moreno2011] Egan, K. G., and Moreno, M. A. 2011. Alcohol references on undergraduate males’ facebook profiles. American journal of men’s health 1557988310394341.
- [Golder and Macy2011] Golder, S., and Macy, M. 2011. Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures. Science 333(6051):1878–1881.
- [Hoh et al.2006] Hoh, B.; Gruteser, M.; Xiong, H.; and Alrabady, A. 2006. Enhancing security and privacy in traffic-monitoring systems. Pervasive Computing, IEEE 5(4):38–46.
- [Holmes et al.2014] Holmes, J.; Guo, Y.; Maheswaran, R.; Nicholls, J.; Meier, P. S.; and Brennan, A. 2014. The impact of spatial and temporal availability of alcohol on its consumption and related harms: A critical review in the context of uk licensing policies. Drug and alcohol review 33(5):515–525.
- [Jackson et al.2014] Jackson, N.; Denny, S.; Sheridan, J.; Fleming, T.; Clark, T.; Teevale, T.; and Ameratunga, S. 2014. Predictors of drinking patterns in adolescence: a latent class analysis. Drug and alcohol dependence 135:133–139.
- [Kershaw, Rowe, and Stacey2014] Kershaw, D.; Rowe, M.; and Stacey, P. 2014. Towards tracking and analysing regional alcohol consumption patterns in the uk through the use of social media. In Proceedings of the 2014 ACM conference on Web science, 220–228. ACM.
- [Koller and Sahami1997] Koller, D., and Sahami, M. 1997. Hierarchically classifying documents using very few words.
- [Krumm and Rouhana2013] Krumm, J., and Rouhana, D. 2013. Placer: semantic place labels from diary data. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and ubiquitous computing, 163–172.
- [Krumm2007] Krumm, J. 2007. Inference attacks on location tracks. In Pervasive Computing. Springer. 127–143.
- [Kuntsche et al.2005] Kuntsche, E.; Knibbe, R.; Gmel, G.; and Engels, R. 2005. Why do young people drink? a review of drinking motives. Clinical psychology review 25(7):841–861.
- [Kypri et al.2008] Kypri, K.; Bell, M. L.; Hay, G. C.; and Baxter, J. 2008. Alcohol outlet density and university student drinking: a national study. Addiction 103(7):1131–1138.
- [Lamb, Paul, and Dredze2013] Lamb, A.; Paul, M. J.; and Dredze, M. 2013. Separating fact from fear: Tracking flu infections on twitter. In HLT-NAACL, 789–795.
- [Liang and Chikritzhs2011] Liang, W., and Chikritzhs, T. 2011. Revealing the link between licensed outlets and violence: counting venues versus measuring alcohol availability. Drug and alcohol review 30(5):524–535.
- [Lin, Hsu, and Lee2012] Lin, M.; Hsu, W.-J.; and Lee, Z. Q. 2012. Predictability of individuals’ mobility with high-resolution positioning data. In UbiComp, 381–390.
- [Litt and Stock2011] Litt, D. M., and Stock, M. L. 2011. Adolescent alcohol-related risk cognitions: The roles of social norms and social networking sites. Psychology of Addictive Behaviors 25(4):708.
- [Livingston2008a] Livingston, M. 2008a. Alcohol outlet density and assault: a spatial analysis. Addiction 103(4):619–628.
- [Livingston2008b] Livingston, M. 2008b. A longitudinal analysis of alcohol outlet density and assault. Alcoholism: Clinical and Experimental Research 32(6):1074–1079.
- [Livingston2011] Livingston, M. 2011. A longitudinal analysis of alcohol outlet density and domestic violence. Addiction 106(5):919–925.
- [Mahmud, Nichols, and Drews2012] Mahmud, J.; Nichols, J.; and Drews, C. 2012. Where is this tweet from? inferring home locations of twitter users. In ICWSM.
- [Moreno et al.2009] Moreno, M. A.; Parks, M. R.; Zimmerman, F. J.; Brito, T. E.; and Christakis, D. A. 2009. Display of health risk behaviors on myspace by adolescents: prevalence and associations. Archives of pediatrics & adolescent medicine 163(1):27–34.
- [Naimi et al.2003] Naimi, T. S.; Brewer, R. D.; Mokdad, A.; Denny, C.; Serdula, M. K.; and Marks, J. S. 2003. Binge drinking among us adults. Jama 289(1):70–75.
- [Nambisan et al.2015] Nambisan, P.; Luo, Z.; Kapoor, A.; Patrick, T. B.; Cisler, R.; et al. 2015. Social media, big data, and public health informatics: Ruminating behavior of depression revealed through twitter. In System Sciences (HICSS), 2015 48th Hawaii International Conference on, 2906–2913. IEEE.
- [Paul and Dredze2011] Paul, M. J., and Dredze, M. 2011. You are what you tweet: Analyzing twitter for public health. In ICWSM, 265–272.
- [Polonec, Major, and Atwood2006] Polonec, L. D.; Major, A. M.; and Atwood, L. E. 2006. Evaluating the believability and effectiveness of the social norms message” most students drink 0 to 4 drinks when they party”. Health communication 20(1):23–34.
- [Pontes et al.2012a] Pontes, T.; Magno, G.; Vasconcelos, M.; Gupta, A.; Almeida, J.; Kumaraguru, P.; and Almeida, V. 2012a. Beware of what you share: Inferring home location in social networks. In Data Mining Workshops (ICDMW), 2012 IEEE 12th International Conference on, 571–578.
- [Pontes et al.2012b] Pontes, T.; Vasconcelos, M.; Almeida, J.; Kumaraguru, P.; and Almeida, V. 2012b. We know where you live: Privacy characterization of foursquare behavior. In UbiComp, 898–905.
- [Rosenquist et al.2010] Rosenquist, J. N.; Murabito, J.; Fowler, J. H.; and Christakis, N. A. 2010. The spread of alcohol consumption behavior in a large social network. Annals of Internal Medicine 152(7):426–433.
- [Sadilek and Kautz2013] Sadilek, A., and Kautz, H. 2013. Modeling the impact of lifestyle on health at scale. In WSDM, 637–646.
- [Sadilek and Krumm2012] Sadilek, A., and Krumm, J. 2012. Far out: Predicting long-term human mobility. In AAAI.
- [Sadilek et al.2013] Sadilek, A.; Brennan, S.; Kautz, H.; and Silenzio, V. 2013. nemesis: Which restaurants should you avoid today? In First AAAI Conference on Human Computation and Crowdsourcing.
- [Sadilek, Kautz, and Silenzio2012a] Sadilek, A.; Kautz, H. A.; and Silenzio, V. 2012a. Modeling spread of disease from social interactions. In ICWSM.
- [Sadilek, Kautz, and Silenzio2012b] Sadilek, A.; Kautz, H. A.; and Silenzio, V. 2012b. Predicting disease transmission from geo-tagged micro-blog data. In AAAI.
- [Scellato et al.2011] Scellato, S.; Noulas, A.; Lambiotte, R.; and Mascolo, C. 2011. Socio-spatial properties of online location-based social networks. ICWSM 11:329–336.
- [Scellato, Noulas, and Mascolo2011] Scellato, S.; Noulas, A.; and Mascolo, C. 2011. Exploiting place features in link prediction on location-based social networks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 1046–1054.
- [Schonlau et al.2008] Schonlau, M.; Scribner, R.; Farley, T. A.; Theall, K. P.; Bluthenthal, R. N.; Scott, M.; and Cohen, D. A. 2008. Alcohol outlet density and alcohol consumption in los angeles county and southern louisiana. Geospatial health 3(1):91.
- [Scribner et al.1999] Scribner, R.; Cohen, D.; Kaplan, S.; and Allen, S. H. 1999. Alcohol availability and homicide in new orleans: conceptual considerations for small area analysis of the effect of alcohol outlet density. Journal of studies on alcohol 60(3):310–316.
- [Scribner et al.2008] Scribner, R.; Mason, K.; Theall, K.; Simonsen, N.; Schneider, S. K.; Towvim, L. G.; and DeJong, W. 2008. The contextual role of alcohol outlet density in college drinking. Journal of Studies on Alcohol and Drugs 69(1):112–120.
- [Scribner, MacKinnon, and Dwyer1994] Scribner, R. A.; MacKinnon, D. P.; and Dwyer, J. H. 1994. Alcohol outlet density and motor vehicle crashes in los angeles county cities. Journal of studies on alcohol 55(4):447–453.
- [Smith et al.2014] Smith, G.; Wieser, R.; Goulding, J.; and Barrack, D. 2014. A refined limit on the predictability of human mobility. In Pervasive Computing, 88–94.
- [Smith2011] Smith, A. 2011. Twitter update 2011. http://pewresearch.org/pubs/2007/twitter-users-cell-phone-2011-demographics.
- [Song et al.2010] Song, C.; Qu, Z.; Blumm, N.; and Barabási, A.-L. 2010. Limits of predictability in human mobility. Science 327(5968):1018–1021.
- [Sparks, Jernigan, and Mosher2011] Sparks, M.; Jernigan, D. H.; and Mosher, J. F. 2011. Regulating alcohol outlet density: An action guide. Community Anti-Drug Coalitions of America.
- [Tamersoy, De Choudhury, and Chau2015] Tamersoy, A.; De Choudhury, M.; and Chau, D. H. 2015. Characterizing smoking and drinking abstinence from social media. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, HT ’15, 139–148. New York, NY, USA: ACM.
- [Tsugawa et al.2015] Tsugawa, S.; Kikuchi, Y.; Kishino, F.; Nakajima, K.; Itoh, Y.; and Ohsaki, H. 2015. Recognizing depression from twitter activity. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 3187–3196. ACM.
- [Ugander et al.2012] Ugander, J.; Backstrom, L.; Marlow, C.; and Kleinberg, J. 2012. Structural diversity in social contagion. Proceedings of the National Academy of Sciences 109(16):5962–5966.
- [U.S. Census Bureau2011] U.S. Census Bureau. 2011. 2010 Census. U.S. Department of Commerce.
- [Weitzman et al.2003] Weitzman, E. R.; Folkman, A.; Folkman, M. K. L.; and Wechsler, H. 2003. The relationship of alcohol outlet density to heavy and frequent drinking and drinking-related problems among college students at eight universities. Health & place 9(1):1–6.
- [White and Horvitz2008] White, R., and Horvitz, E. 2008. Cyberchondria: Studies of the escalation of medical concerns in web search. Technical Report MSR-TR-2008-177, Microsoft Research. Appearing in ACM Transactions on Information Systems, 27(4), Article 23, November 2009, DOI 101145/1629096.1629101.
- [Xing and Ghorbani2004] Xing, W., and Ghorbani, A. 2004. Weighted pagerank algorithm. In Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on, 305–314.
- [Young2010] Young, M. M. 2010. Twitter me: using micro-blogging to motivate teenagers to exercise. In Global Perspectives on Design Science Research. Springer. 439–448.
- [Zhu, Gorman, and Horel2004] Zhu, L.; Gorman, D. M.; and Horel, S. 2004. Alcohol outlet density and violence: a geospatial analysis. Alcohol and alcoholism 39(4):369–375.