According to the World Health Organization (WHO), smoking is the leading public health problem worldwide, resulting in millions of preventable deaths each year and is responsible for a number of serious chronic diseases (e.g., hypertension, atherosclerosis, cancer) .
Globally, male and female smokers have their life expectancy reduced by and years, respectively . Moreover, at least half of all smokers worldwide die prematurely from smoking . It is important to emphasize that smoking is not only harmful to smokers themselves, but it is also a major risk factor for passive smokers .
The modernization of societies has ignited a recent trend that promotes a lifestyle in which smoking is considered an outdated habit. More and more people are taking up a sport, or are beginning to follow a healthy eating regime . An important role for the engagement of people with these healthy habits plays the technology that is constantly evolving and gets integrated into everyday life. The rapid growth of portable and wearable devices has brought with it a great increase in applications that help people develop and maintain a healthy lifestyle. From tracking meals and calories to measuring physical activity or sleep, the applications now available to users are multiple, easy to use, and unobtrusive . However, the objective monitoring of smoking behavior is still an open research problem. Research shows  that tailored feedback to the smoker can greatly facilitate the reduction or even permanent cessation of smoking.
Several works exist in the literature that approach the problem of smoking behavior monitoring using body-worn sensors . The work presented in  suggests a method that combines the data from a wrist-mounted inertial measurement unit (IMU) sensor and a chest-worn respiratory inductive plethysmography (RIP) sensor towards the detection of smoking gestures. Evaluation using data from daily smokers reveals a recall of . It should be mentioned, however, that devices as bulky as the RIP sensor are too obtrusive for the user to properly simulate the normal smoking behavior. The work of M. Shoaib et al. 
proposes a two-step algorithm towards the detection of smoking events. In the first stage, the data are crudely classified, while at the second stage a rule-based correction of the first-stage classification is applied. The classifiers tested by the authors are random forest (RF), decision tree (DT) and support vector machine (SVM). For each classifier, a total offeatures are extracted from the D accelerometer and gyroscope measurements. According to the authors, the second step of their algorithm corrects up to % of the misclassified samples. Evaluation is performed using their dataset of participants with a total duration of hours, where the authors achieved an F-score of -.
In our work, we propose the use of a smoking behavior model that is based on two fundamental components: a) the puff (also referred to as smoking gesture in the literature), defined as the series of hand movements that bring an active cigarette to the mouth with the purpose of smoking it and then back to rest, and b) the smoking session, defined as the act of consuming a cigarette. In particular, we model smoking behavior as a series of smoking sessions that occur during the day. Subsequently, each smoking session is modeled as a series of puffs. Figure 1 illustrates the adopted smoking behavior model. Furthermore, we suggest a two-step, bottom-up method towards the objective and automatic monitoring of smoking behavior using all-day, free-living IMU recordings from an off-the-shelf smartwatch. In the first step, we use an artificial neural network (ANN) with convolutional and recurrent layers to detect puffs during a smoking session. In the second step, we use the distribution of the detected puffs to localize the smoking sessions throughout the day.
Ii Detection of puffs
Ii-a Data pre-processing
represent the vector that contains the
D acceleration and orientation velocity measurements for a moment. Then, a complete recording of seconds can be represented by the signal , where is the length of the recording in samples and is the sampling frequency in Hz.
Smoking cigarettes is a process that can be completed by using either hand (right or left) or, in some cases, a combination of both. In order to achieve uniformity among data from different participants, we consider the right hand as the reference and transform all left-handed smoking sessions using the hand mirroring process proposed by Kyritsis et al. . Particularly, all recordings that are collected with the participant wearing the smartwatch on the left wrist , are transformed into by changing the direction of the first, fifth and sixth channels (i.e., , and ) of .
Furthermore, accelerometer measurements also include the influence of the Earth’s gravitational field. To attenuate this undesirable effect, a high-pass finite impulse response (FIR) filter is applied to each of the acceleration streams (i.e., the first, second, and third channels of ), independently. Experimentally, we obtained satisfactory results with a cut-off frequency of Hz and a filter length equal to samples (which corresponds to seconds).
Ii-B Training the puff detection model
Given a recording that corresponds to a smoking session, we extract training examples using a sliding window. More specifically, the sliding window has a length that corresponds to seconds ( samples) and a step that corresponds to seconds ( samples). We selected a window length equal to seconds as it approximates the median puff duration in the SED dataset (Table I). Each extracted window has dimensions .
|Number of instances||20||276||10||39|
In order to train the network, each window needs to be associated with a label that would indicate if the window corresponds to a puff or not (, respectively). We use the following formula to perform the labeling process:
where is the moment at which the -th puff ends (hand has returned to rest) according to ground truth (GT) and is the timestamp associated with the right end of the -th extracted window. Moreover, we select to be equal to seconds. Figure 2 showcases the window labeling process.
The next step is to artificially augment the training set by simulating different positions of the smartwatch, that may occur involuntarilyand , respectively. These two numbers represent the angles that the smartwatch has rotated around the (parallel to the subject’s arm) and (perpendicular to the screen of the smartwatch) axes. The transformation for each window
is selected to be one of the following (with equal probability): a) rotation around, b) rotation around , c) rotation around and then around , or d) rotation around and then around . The motivation behind the augmentation step was the significant increase in the performance reported in .
The proposed model is a tuned-down version of the renowned VGG architecture 
. In particular, our network includes a convolutional and a recurrent part. The convolutional part contains three 1D convolutional layers, with each of the first two followed by a max pooling layer with a decimation factor of. The convolutional layers have , and filters, with a size of , and
, respectively. All convolutional layers use a unary stride and the rectified linear unit (ReLU) as the non-linearity. The recurrent part of the network consists of a single long-short-term-memory (LSTM) layer with
cells and the sigmoid function as the activation of the recurrent steps. The output of the LSTM is propagated to a fully connected layer with a single neuron and the sigmoid activation function. In order to avoid overfitting, we apply dropout to the inputs of the fully connected layer with a probability of 50%. The network minimizes the binary cross-entropy loss with the RMSProp optimizer, and uses a learning rate of, a batch size of and a number of epochs. In a compact notation, the network can be written as Conv()-Pool()-Conv()-Pool()-Conv()-LSTM()-FC(), where Conv() represents a convolutional layer with filters and a filter size of , Pool() is pooling layer with a decimation factor of , LSTM() is an LSTM layer with hidden cells and FC() is a fully connected layer with a single neuron.
Ii-C Puff detection
By forwarding windows from a recording to the trained puff detection network (Section II-B), we obtain the predictions vector with length . Essentially is the probability that the -th window is a puff and represents the total number of extracted windows of length and step .
Puff detection is achieved by initially performing a local maxima search in , with a minimum distance between successive peaks equal to samples. The next step is to discard peaks that are associated with a probability that is lower than a threshold set to . Both the minimum distance between peaks and were selected by experimenting with a small part of the SED dataset. As a result, we obtain the set of detected puffs, , where is the timestamp of -th detected puff and the total number of detected puffs. The process is illustrated in Figure 3.
Iii Temporal localization of smoking sessions
The second step of the proposed algorithm aims at the temporal localization of smoking sessions that occur during a day. In our early experiments we observed that in all-day recordings the density of puffs is increased during a smoking session and reduced everywhere else. As a result, in the second step of our algorithm we take advantage of this observation and attempt to group the detected puffs into smoking session clusters using the density-based spatial clustering of applications with noise (DBSCAN)  algorithm.
More specifically, let be an all-day, in-the-wild recording with dimensions , where . Next, we use the trained puff detection model (Section II-B
) to produce the set of puff detection estimates. Subsequently, we apply clustering using DBSCAN on the set using a minimum distance between clusters that corresponds to seconds (as this is the minimum distance between consecutive smoking sessions in the SED-FL dataset) and a minimum number of points per cluster set to .
Each cluster that DBSCAN produces is then associated with the first and last timestamps of the detected puffs that belong to that specific cluster. This pair of timestamps corresponds to the start and end moments of a smoking session. Formally, the final output of the algorithm is the set , where represents the start and end timestamps of the -th detected smoking session. An example depicting the temporal localization of smoking sessions can be found in Figure 4.
Iv Experiments and evaluation
In order to fine-tune and evaluate our method we collected two datasets. The SED dataset was captured in semi-controlled environments (e.g., private residences or cafes) and contains a single smoking session per recording. On the other hand, the SED-FL dataset was captured under in-the-wild conditions and contains all-day recordings that include smoking sessions and other daily activities (e.g., working, eating). Inertial data were collected using a Mobvoi TicWatch E smartwatch at a sampling rate equal to Hz.
The SED dataset consists of subjects performing smoking sessions, with a total duration of hours. The SED-FL dataset consists of all-day sessions from subjects, with a total duration of hours (Table I). Three of the subjects participate in both datasets. It should be emphasized that we asked from the subjects to smoke naturally; as a result, they were free to engage in a discussion or perform additional activities (two instances are depicted in Figure 5). All subjects were already smokers and signed an informed consent prior to their participation. In order to label the data in SED, we recorded the smoking sessions using the camera from a typical smartphone. To produce the GT for the all-day, in-the-wild sessions of SED-FL, a smartwatch application was developed that enabled subjects to easily note the start and end timestamps of their smoking sessions. It is worth noting that both datasets deal with the consumption of tobacco using cigarettes; no electronic cigarettes (also known as vaping devices), pipes or heated tobacco products were used. Both datasets are publicly available at https://mug.ee.auth.gr/smoking-event-detection/.
We conducted two series of experiments. In the first experiment (EX-I), we evaluate the puff detection performance using the SED dataset. Moreover, we compare the performance of the proposed puff detection approach with the method proposed in . For the second experiment (EX-II), we evaluated the smoking session temporal localization performance using the SED-FL dataset. Both EX-I and EX-II are performed in a leave-one-subject-out (LOSO) fashion.
In order to measure the puff detection performance (EX-I), we apply the strict evaluation scheme presented in . An example of the evaluation scheme is presented in Figure 3. Essentially: a) only the first detected puff within the duration of a GT interval is considered as a true positive (TP), all subsequent ones count as false positives (FP), b) GT intervals without detections count as false negative (FN) and c) predictions outside GT intervals are considered as FP. It should be noted that the evaluation scheme of  cannot calculate true negatives (TN). However, at a window level we can effectively measure TP/FP/FN and TN; i.e., by comparing the label of each extracted window with the GT. As a result, we can calculate the weighted accuracy metric, defined as , using a weight equal to (total time spend during smoking sessions divided by the total time spend during puffs).
Regarding EX-II, a detected smoking session is considered a TP if it’s middle timestamp (calculated as ) is within the duration of a GT interval; in any other case is considered a FP. In addition, GT intervals without detections are considered as FN. Figure 4 illustrates the aforementioned evaluation scheme. Similar to EX-I, we also calculated the weighted accuracy for EX-II using a weight equal to . Finally, we calculated the Jaccard Index (JI), defined as , where and are the intervals of the true and the predicted smoking sessions, respectively.
The obtained results showcase the high potential of our approach; both towards the detection of individual puffs (upper part of Table II), as well as for the temporal localization of smoking events in-the-wild (lower part of Table II). More specifically, regarding EX-I, the proposed approach achieves a weighted accuracy of and an F-score of using the stricter evaluation scheme of  (against and obtained by ). Concerning EX-II, our approach achieves an F-score/weighted accuracy/JI equal to // which indicates that smoking sessions can be effectively detected under in-the-wild conditions.
In this paper we present a two-step, bottom-up method towards the in-the-wild monitoring of smoking behavior. LOSO experimental results using our realistic SED and SED-FL datasets reveal the high potential of our approach towards the detection of puffs and the localization of smoking sessions during the day, under in-the-wild conditions.
The work leading to these results has received funding from the EU Commission under Grant Agreement No. 965231, the REBECCA H2020 project (https://rebeccaproject.eu/).
-  (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §III.
-  (2002) Annual smoking-attributable mortality, years of potential life lost, and economic costs–united states, 1995-1999. MMWR. Morbidity and mortality weekly report 51 (14), pp. 300–303. Cited by: §I.
-  (2019) Wearable sensors for monitoring of cigarette smoking in free-living: a systematic review. Sensors. Cited by: §I.
-  (2018) Global wellness economy monitor. Author Miami. Cited by: §I.
-  (2020) A data driven end-to-end approach for in-the-wild monitoring of eating behavior using smartwatches. IEEE Journal of Biomedical and Health Informatics 25 (1), pp. 22–34. Cited by: §II-A, §II-B, §IV-C, §V.
-  (2005) Self-help interventions for smoking cessation. Cochrane database of systematic reviews (3). Cited by: §I.
-  (2017) WHO report on the global tobacco epidemic, 2017: monitoring tobacco use and prevention policies. World Health Organization. Cited by: §I, §I.
-  (2001) Acute effects of passive smoking on the coronary circulation in healthy young adults. Jama 286 (4). Cited by: §I.
-  (2015) PuffMarker: a multi-sensor approach for pinpointing the timing of first lapse in smoking cessation. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 999–1010. Cited by: §I.
-  (2016) A hierarchical lazy smoking detection algorithm using smartwatch sensors. In 2016 IEEE 18th International Conference on e-Health Networking, Applications and Services (Healthcom), pp. 1–6. Cited by: §I, §IV-B, TABLE II, §V.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II-B.
-  (2012) There’s an app for that: content analysis of paid health and fitness apps. Journal of medical Internet research 14 (3), pp. e72. Cited by: §I.