Collecting big behavioral data for measuring behavior against obesity

by   Vasileios Papapanagiotou, et al.

Obesity is currently affecting very large portions of the global population. Effective prevention and treatment starts at the early age and requires objective knowledge of population-level behavior on the region/neighborhood scale. To this end, we present a system for extracting and collecting behavioral information on the individual-level objectively and automatically. The behavioral information is related to physical activity, types of visited places, and transportation mode used between them. The system employs indicator-extraction algorithms from the literature which we evaluate on publicly available datasets. The system has been developed and integrated in the context of the EU-funded BigO project that aims at preventing obesity in young populations.



There are no comments yet.


page 1

page 2

page 3

page 4


Computational Social Scientist Beware: Simpson's Paradox in Behavioral Data

Observational data about human behavior is often heterogeneous, i.e., ge...

The structure of behavioral data

For more than a century, scientists have been collecting behavioral data...

Computer vision tools for the non-invasive assessment of autism-related behavioral markers

The early detection of developmental disorders is key to child outcome, ...

A Methodology for Obtaining Objective Measurements of Population Obesogenic Behaviors in Relation to the Environment

The way we eat and what we eat, the way we move and the way we sleep sig...

Inferring the Spatial Distribution of Physical Activity in Children Population from Characteristics of the Environment

Obesity affects a rising percentage of the children and adolescent popul...

BigO: A public health decision support system for measuring obesogenic behaviors of children in relation to their local environment

Obesity is a complex disease and its prevalence depends on multiple fact...

Using Simpson's Paradox to Discover Interesting Patterns in Behavioral Data

We describe a data-driven discovery method that leverages Simpson's para...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Obesity is affecting a very large portion of the population world-wide, including children and teenagers. According to the World Health Organization, over million children were obese or overweight in 2016. Many “blanket policies” have been applied so far, but success rates are low while relapse is quite high. To effectively treat and also prevent the prevalence of obesity, detailed behavioral profiles on the population level are needed [1]

. Creating such profiles can answer questions such as “what is the average physical activity level of a specific age group in a specific region” or “does visiting fast-food shops and similar food outlets relate to low physical activity level or obesity”.

Creating population profiles requires collecting large volumes of individual-level behavioral information. Traditional ways of acquiring such information are questionnaires, time-use surveys, etc [2], however, their accuracy has been long challenged [3]

. On the other hand, technological methods such as the combination of mobile phones and smart watches that capture wearable signals with relevant signal-processing and machine-learning algorithms can be a better replacement

[4]. These methods offer many advantages over traditional surveys. From the participant’s view, they reduce the need for user feedback and also eliminate any personal bias/subjectiveness. From the surveyor’s view, they reduce the effort for gathering the data as well as the need for data curation (since the gathered data are completely structured). Finally, there are additional advantages: these methods and tools can be applied to larger populations, and for longer durations, and are thus able to gather much bigger volumes of data (big data).

In this work we present a system for collecting big behavioral data from large populations in order to facilitate large-scale analysis of behavior related to the prevalence and prevention of obesity. This system has been designed, implemented, evaluated, and integrated within the context of the EU-funded BigO project [4]. It aims at extracting behavior related to physical activity, types of places that are visited (e.g. parks, gyms, or fast-foods), and how these places are accessed (i.e. transportation mode). We propose a set of basic behavioral indicators on the individual level which can then be used to build population-level behavioral profiles. We implement and evaluate algorithms from the literature that extract these individual-level indicators and evaluate them on publicly available datasets in order to demonstrate the feasibility of such a system.

Ii Extracting behavioral indicators

Our system aims at collecting data in three main areas of behavior, specifically physical activity, type of visited places, and transportation mode. Each one of these has a strong connection to obesity or obesogenic behavior. Particularly, low level of physical activity is a well established risk factor of obesity [5, 6]

. Thus, we target at estimating the number of steps walked, as well as the type of physical activity that is performed by the individual.

Knowing what kind of places one visits is also targeted by our system, as it can be correlated with healthy and non-healthy behavior. For example, there is evidence of connection between fast-food consumption and obesity [7] that needs to be further explored, while there is evidence suggesting that availability of certain park facilities plays an important role in promoting physical activity and healthy weight status [8]. Additionally, behavior within parks and recreational facilities can steer design decisions for new parks or facilities to increase physical activity levels of visitors [9]. Thus, we aim at accurately detecting the locations that an individual visits, based on the location data from his/her mobile phone. The co-ordinates of the visited locations can then be cross-checked using publicly available map repositories to derive the type of facility that was visited, i.e. fast-food, restaurant, park, gym, etc.

In addition to visited locations, our system also aims at detecting the transportation mode used when individuals move between locations, since evidence exists that the way one commutes is correlated with overweight and obesity [10].

We have selected algorithms for behavioral indicator extraction based on their reported effectiveness, the clarity of the relevant publications, and the results from small-scale comparisons with other counterparts from literature.

Ii-a Physical activity

For physical activity we focus on counting steps and recognizing the physical activity type performed at each time and also implement our own algorithm for activity counts.

The step counting algorithm is presented in [11]. It detects local maxima on the acceleration magnitude, as measured by a mobile phone’s accelerometer. The local maxima are then filtered against three criteria; the maxima that satisfy all three criteria are counted are steps.

The first criterion is periodicity: steps are required to occur at a rate approximately in the range of to Hz. The second criterion is similarity: subsequent steps should exhibit approximately the same level of acceleration. The algorithm takes into account the steps corresponding each foot separately, and therefore requires similar acceleration magnitude for the -st, -rd, -th and so forth steps separately from the -nd, -th, -th, and so forth steps. Finally, the third criterion is continuity, which requires that steps occur in groups, and not on their own. We have selected a threshold of at least steps in order to count them.

While smart watches are also equipped with accelerometers, the signals that they capture are different from the mobile phone. In particular, we can take advantage of the fixed location of the smart watch and use specialized step-counting algorithms. The algorithm we have chosen is presented in [12] and uses a delayed and filtered replica of the acceleration magnitude in order to extract “armed” segments. Each such segment contains a local maximum, which is counted as a step if it satisfies a series of pre-defined thresholds.

The physical activity recognition algorithm we use is presented in [13]. It also operates on the acceleration magnitude of a accelerometer: it extracts short, overlapping windows ( s length and

s step) and estimates several time-domain (such as mean, standard deviation, peak-to-peak distance, and signal-magnitude area) and frequency-domain (such as power spectral density) features. The features are then standardized and used in a multi-class (one-vs-one) SVM classifier in order to infer a single physical activity class per window. We also apply a majority voting filter of

min length on the predicted classes to account for some misclassifications.

Ii-B Visited-locations detection

To detect visited locations from a set of location (GPS) co-ordinates we follow the approach of [14]. The algorithm introduces a metric called move-ability which is the ratio of actual distance someone moved over the total traveled distance. Move-ability is then used to compute a density value which is used with a modified version of the DBSCAN algorithm. By applying the modified DBSCAN we obtain a set of clusters which correspond to actual visited locations. The original data (GPS co-ordinates) can then be cross-referenced against the detected locations in order to obtain the arrival and departure time-stamps from each location.

Ii-C Transportation-mode recognition

Recognizing transportation mode is done based on the algorithm of [15]. However, we first need to detect the trips. We consider valid trips all the segments from one visited location to the next (departure from the first until arrival to the next) during which we have no missing accelerometer data and the average location-speed is at least km/h.

The transportation-mode recognition algorithm of [15] extracts overlapping windows ( min length and s step) of the acceleration magnitude and computes time-domain and frequency-domain features (similarly to [13]

). We have enhanced the set of features with power spectral density features. We also use multi-class (one-vs-one) SVM classifier instead of the random forest classifiers used in

[15]. Finally, we perform majority-voting on the detected transportation mode labels.

Iii Evaluation datasets

To evaluate the selected algorithms we use three different datasets. The first one is used for step-counting validation and is published as part of [16]. It contains time-annotated sensor signals obtained from smart phones in typical, unconstrained use while walking. In total, participants were asked to walk a route at three different walk paces (normal, fast, and slow). Each participant walked the same distance and changed her/his speed at markers installed on the path and carried one or two phones placed at varying positions (in a front or back trouser pocket, in a backpack/handbag, or in a hand with or without simultaneous typing). This dataset has been used for a fair, quantitative comparison of standard algorithms for walk detection and step counting.

The second dataset is the PAMAP2 activity type detection dataset [13] that contains activities. Activity types were selected to include basic activities (walking, running, traversing stairs), postures (lying, sitting and standing), common household activities (ironing, vacuum cleaning), and fitness activities (rope jumping). Each of the subjects had to follow this protocol, performing all defined activities in the way most suitable for the subject. Most of the activities from the protocol were performed over approximately minutes, except ascending/descending stairs and rope jumping. Nearly hours of data were collected altogether.

The dataset used for validating points-of-interest and transportation mode algorithms is the Sussex-Huawei locomotion (SHL) dataset [17]. The SHL dataset was collected by the Wearable Technologies Lab at the University of Sussex as part of a research project funded by Huawei. It is a versatile annotated dataset of modes of locomotion and transportation of mobile users. It includes recordings by 3 participants over 3 days in 2017, engaging in 8 different modes of transportation in real-life setting in the United Kingdom.

Iv Preliminary evaluation

This section presents the evaluation results for the different individual-level behavioral indicator extraction algorithms on the selected datasets. Table I presents the evaluation of the step-counting algorithm of [11] on the dataset of [16]. We evaluate the algorithm on all of the available positions of the mobile phone. Out of all the different positions, hand-held (with and without using the mobile phone) and placed in a backpack are the ones with the most available recordings; the error is quite low in these cases, with less than steps of absolute error on average. The handbag placement yields the highest error which can be attributed to the decreased “sensitivity” of the mobile phone’s accelerometer when placed in a bag full of other things and hanged from the subject’s shoulder.

Position #
error (%)
Hand-held & using
  back pocket
  front pocket
Shirt pocket
TABLE I: Step counting results of [11] on [16] (meanstandard deviation).

Figure 1

presents a heat-map of the confusion matrix for the task of physical activity type recognition, based on the algorithm of

[13]. The SVM classifiers have been trained in a typical leave-one-subject-out (LOSO) fashion. In general, the number of misclassifications is relatively low, and the actual misclassifications occur between similar classes. For example, looking at the second and third row and column of the matrix, significant misclassification is observed. However, these two classes correspond to sitting and standing, which have very similar footprints on the mobile phone’s accelerometer.

Fig. 1: Confusion matrix heat-map for physical activity type recognition of [13]. Rows correspond to actual class and columns to predicted class. The class labels are 1: lying, 2: sitting, 3: standing, 4: walking, 5: running, 6: cycling, 7: Nordic walking, 8: ascending stairs, 9: descending stairs, 10: vacuum cleaning, 11: ironing, 12: rope jumping.

In Table II we present evaluation results for the algorithm of [14] for detecting visited locations based on location data of [17]. We present as TP the number of visited locations that have been correctly identified by the algorithm using a distance threshold of m. We also count the number of visited locations that the algorithm failed to detect as FN, and the number of erroneous detections as FP. The algorithm achieves an overall recall rate of more than with a precision rate of .

Subject TP FP FN Precision Recall F1-score
TABLE II: Classification results for visited locations detection using the algorithm of [14] on the subjects of the SHL dataset [17]. Results are shown individually per subject, and across all subjects (aggregated as a single subject, not averaged).

Finally, Figure 2 presents the per-class recall and precision for transportation-mode detection using the algorithm of [15]. Both are higher than for walk/run, bike, and train/subway, and in some cases quite higher (precision for walk/run is over ). We have observed that most misclassifications (approximately of them) are between the car and bus classes, and between the bus and train/subway classes (approximately ). This can be attributed to the similarities between the “movement” of these transportation methods. A very encouraging result, however, is the accuracy between vehicle vs non-vehicle which is (recall and precision are and respectively), since it has a significant impact in overall physical activity level.

Fig. 2: Classification results for transportation mode recognition using the algorithm of [15] on the subjects of the SHL dataset [17]. Note that “vehicle” vs. “non-vehicle” yields recall of and precision of .

V Examples of data views

The presented system has been integrated in the BigO project and is actively used. So far, more than students have contributed data by using an application for Android smart phones and watches. No directly identifiable information is collected (such as names and e-mail addresses). Students participate voluntarily (opt-in) with written consent from their parents or legal guardians where necessary. Ethical approvals have been received for all pilot studies of the project.

As an example of the different aggregations and data views our system can provide, we present two examples with data from users of one of the pilot sites, the Biomedical Research Foundation of the Academy of Athens (BRFAA). Table III shows aggregate “profiles” for two users with very different behavior. More detailed information, as well as different types of it, can be extracted by our proposed system. Collecting such profiles on the large scale can be used to develop explanatory and predictive models related to obesity [4]. Table IV groups users of BRFAA based on their most frequent destination after school. We also show the average (and std) BMI per group. Such information may be related to obesogenic behavior and can be objectively, automatically, and unobtrusively collected by our system.

user 1 user 2
Gender male female

BMI z-score

Average hours of daily monitoring
Daily steps
Daily average activity counts per minute
Weekly visits to cafes
Weekly visits to food retailers
TABLE III: Example profiles of two users of the system.
Destination Number of users BMI
TABLE IV: Most frequent destination after school.

Vi Conclusions & future work

In this work we have presented a system for collecting big behavioral data from individuals regarding their physical activity, types of places they visit, and how they move (transportation). This kind of data can enable the creation of population-level behavioral profiles per region/neighborhood in order to study the relationship of such behavior with obesity and the risk for developing it. To this end, we select, implement, and evaluate algorithms from the literature that measure physical activity level (number of steps, type of physical activity), that detect visited places based on location data which can then be cross-referenced against publicly available map sources, and the transportation mode used to move from one visited place to another. We evaluate these algorithms on publicly available datasets and present the results, demonstrating the feasibility of our proposed system.


  • [1] M. Blüher, “Obesity: global epidemiology and pathogenesis,” Nature Reviews Endocrinology, vol. 15, no. 5, pp. 288–298, 2019.
  • [2] J.-S. Shim, K. Oh, and H. C. Kim, “Dietary assessment methods in epidemiologic studies,” Epidemiol Health, vol. 36, no. 0, pp. e2 014 009–0, 2014.
  • [3] A. Schatzkin et al., “A comparison of a food frequency questionnaire with a 24-hour recall for use in an epidemiological cohort study: results from the biomarker-based Observing Protein and Energy Nutrition (OPEN) study,” International Journal of Epidemiology, vol. 32, no. 6, pp. 1054–1062, 12 2003.
  • [4] C. Diou et al., “A methodology for obtaining objective measurements of population obesogenic behaviors in relation to the environment,” Statistical Journal of the IAOS, 2019.
  • [5] A. P. Hills, N. A. King, and T. P. Armstrong, “The contribution of physical activity and sedentary behaviours to the growth and development of children and adolescents,” Sports Medicine, vol. 37, no. 6, pp. 533–545, 2007.
  • [6] A. P. Hills, A. D. Okely, and L. A. Baur, “Addressing childhood obesity through increased physical activity,” Nature Reviews Endocrinology, vol. 6, no. 10, pp. 543–549, 2010.
  • [7] R. Rosenheck, “Fast food consumption and increased caloric intake: a systematic review of a trajectory towards weight gain and obesity risk,” Obesity Reviews, vol. 9, no. 6, pp. 535–547, 2008.
  • [8] L. R. Potwarka, A. T. Kaczynski, and A. L. Flack, “Places to play: Association of park space and facilities with healthy weight status among children,” Journal of Community Health, vol. 33, no. 5, pp. 344–350, 10 2008.
  • [9] G. M. Besenyi et al., “Demographic variations in observed energy expenditure across park activity areas,” Preventive Medicine, vol. 56, no. 1, pp. 79 – 81, 2013.
  • [10] M. Lindström, “Means of transportation to work and overweight and obesity: A population-based study in southern sweden,” Preventive Medicine, vol. 46, no. 1, pp. 22–28, 2008.
  • [11] F. Gu et al., “Robust and accurate smartphone-based step counting for indoor localization,” IEEE Sensors Journal, vol. 17, no. 11, pp. 3453–3460, 6 2017.
  • [12] V. Genovese, A. Mannini, and A. M. Sabatini, “A smartwatch step counter for slow and intermittent ambulation,” IEEE Access, vol. 5, pp. 13 028–13 037, 2017.
  • [13] A. Reiss and D. Stricker, “Creating and benchmarking a new dataset for physical activity monitoring,” in Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments, ser. PETRA ’12.   New York, NY, USA: Association for Computing Machinery, 2012.
  • [14] T. Luo et al., “An improved DBSCAN algorithm to detect stops in individual trajectories,” ISPRS International Journal of Geo-Information, vol. 6, no. 3, p. 63, 2 2017.
  • [15] M. A. Shafique and E. Hato, “Travel mode detection with varying smartphone data collection frequencies,” Sensors (Switzerland), vol. 16, no. 5, p. 716, 5 2016.
  • [16] A. Brajdic and R. Harle, “Walk detection and step counting on unconstrained smartphones,” in Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, ser. UbiComp ’13.   New York, NY, USA: Association for Computing Machinery, 2013, p. 225–234.
  • [17] L. Wang et al., “Summary of the sussex-huawei locomotion-transportation recognition challenge,” in Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, ser. UbiComp ’18.   New York, NY, USA: Association for Computing Machinery, 2018, p. 1521–1530.