A Study of Machine Learning Models in Predicting the Intention of Adolescents to Smoke Cigarettes

10/28/2019 ∙ by Seung Joon Nam, et al. ∙ 0

The use of electronic cigarette (e-cigarette) is increasing among adolescents. This is problematic since consuming nicotine at an early age can cause harmful effects in developing teenager's brain and health. Additionally, the use of e-cigarette has a possibility of leading to the use of cigarettes, which is more severe. There were many researches about e-cigarette and cigarette that mostly focused on finding and analyzing causes of smoking using conventional statistics. However, there is a lack of research on developing prediction models, which is more applicable to anti-smoking campaign, about e-cigarette and cigarette. In this paper, we research the prediction models that can be used to predict an individual e-cigarette user's (including non-e-cigarette users) intention to smoke cigarettes, so that one can be early informed about the risk of going down the path of smoking cigarettes. To construct the prediction models, five machine learning (ML) algorithms are exploited and tested for their accuracy in predicting the intention to smoke cigarettes among never smokers using data from the 2018 National Youth Tobacco Survey (NYTS). In our investigation, the Gradient Boosting Classifier, one of the prediction models, shows the highest accuracy out of all the other models. Also, with the best prediction model, we made a public website that enables users to input information to predict their intentions of smoking cigarettes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The number of high school students who use e-cigarette increased by 78% resulting in 3.05 million and the number of middle schoolers who use e-cigarette increased by 48% bringing the total up to 570,000 between 2017-2018 2018NYTSData. This brings the total, according to 2018 National Youth Tobacco Survey (NYTS), to about 3.6 million middle and high school students using e-cigarette 2018NYTSData. The problems with youths using e-cigarette is that exposure to nicotine at an early age can cause addiction and harm the developing brain ECIGFACT. Also, Bunnell et al. bunnell2015intentions found that e-cigarette use has associations with smoking cigarettes. Due to the fact that e-cigarette use is relatively a recent issue, there are only a small number of research done regarding the use of e-cigarette for adolescents dawkins2013vaping; palazzolo2013electronic; miech2017cigarette; raloff2015dangers; dutra2014electronic; kosmider2014carbonyl; littlefield2015electronic; pepper2017risk; pepper2017risk; huey2017escape; schneider2015vaping; mccabe2017associations; penzes2018bidirectional; arnold2014vaping; miech2017kids; schripp2013does; westling2017electronic; mccabe2017smoking; tsai2018reasons. Dawkins et al. dawkins2013vaping discusses how vaping is used to decrease the dependence on cigarettes and that e-cigarettes are better than nicotine replacement therapy. Some examples of side effects of vaping were throat irritation and mouth irritation but only a small percent of people had these effects. In another research, Palazzolo palazzolo2013electronic states that people are not sure of the harmful results of vaping due to lack of empirical data, but there were some cases of people negatively affected. Also, McCabe et al. mccabe2017smoking states that students who smoke e-cigarettes early on tend to smoke as they get older, so it is a gateway drug to other worse drugs. Chapman chapman2014cigarettes states that vaping could have a positive impact on people to transition from smoking cigarettes to vaping, which helps people to move away from smoking. However, a negative impact for non-smokers would be to move to vaping, which could lead to other more harmful drugs. Another research carried by Miech miech2017cigarette states that people who vaped were 4 times more likely to smoke cigarettes, and vaping does not predict the decrease in smoking. In the research by Dutra dutra2014electronic, it is stated that adolescents who have vaped before have a higher chance of smoking cigarettes. Also, most of the vapers smoked cigarettes as well. According to the research by huey2017escape, more than 250,000 teens who never smoked a regular cigarette have vaped and those who did are twice more likely to smoke regular cigarette in the future.

It is evident from the multiple research papers that vaping leads to smoking cigarettes later on. However, there is a methodological limitation among these researches. Most of the research utilized statistical methods to analyze the relationship between causes and effects for e-cigarette and/or smoking habit. For example, one researchbunnell2015intentions discusses the association between e-cigarette use and smoking intention among US youths who have never smoked a traditional cigarette. The article implemented chi-squared tests, multivariate logistic regression, and other models to analyze the data. Although analyzing data using logistic regression model was appropriate for their case of study, it may not be the most practical in terms of anti-smoking campaign, preventing smoking habit. Using prediction models developed from machine learning (ML) algorithms would be more useful, because it can predict whether a person will have the intention to smoke cigarette or not depending on his or her information about e-cigarette use as well as race, ethnicity, gender, and the environment. Additionally, a prediction can directly help individuals stay away from the path of smoking cigarettes, which is more helpful than simply analyzing data.

In this paper, we use the NYTS results from 2018 and construct multiple prediction models (e.g.,Gradient Boosting Classifier, and Decision Tree Classifier) that can predict whether a person will have an intention to smoke cigarettes or not. After data analysis, Gradient Boosting Classifier, one of the prediction models, had the highest accuracy of 93% out of all the models tested. We divide the data into two sets: never-smokers and smokers of cigarette. The group of never-smokers were analyzed to find the best fitting model to predict the intention to smoke cigarettes for both e-cigarette smokers and non-e-cigarette smokers. In addition, we create a website involving the Gradient Boosting Classifier model in order to allow the public to input factors (e.g., sex, race, and age) and receive a prediction of whether or not they will have a high intention to smoke cigarettes or not. This will give the general public more awareness of their position as to whether or not they will smoke cigarettes and possibly steer away from the path of smoking cigarettes.

Consequently, there are two contributions in this paper: (i) found the best-fitting model to predict smoking intention from the NYTS data, and (ii) create a website to help students, especially e-cigarette smokers, be able to prevent e-cigarette use due to possible chance of smoking cigarette.

This paper is organized as following: Section 2 introduces background of ML techniques; Section 3 presents analysis methods; Section 4 explains prediction system for electronic cigarette usage based on smoking intention; Section 5 includes the conclusion and future study.

2 Background of Machine Learning Techniques

In this section, we introduce the CRISP-DM (Cross-Industry Standard Process for Data Mining) process model wirth2000crisp. The CRISP-DM process model is a comprehensive and simple method for data mining and analysis. Throughout Section 2.1, we describe the six steps of the CRISP-DM process model. In addition, five machine learning algorithms used in this research are introduced in Section 2.2.

2.1 Crisp-Dm

CRISP-DM wirth2000crisp (see Figure 1) is a cross-industry standard process for data mining and analysis. CRISP-DM helps plan out a solid method of running a data mining project and directs at making data mining and analysis projects more efficient, cheaper, and faster. CRISP-DM is comprised of six steps: (1) business understanding, (2) data understanding, (3) data preparation, (4) modeling, (5) evaluation, and (6) deployment.

Figure 1: Process of CRISP-DM

2.1.1 Business Understanding

Business Understanding is introducing the objectives of the project and understanding what the goals will be. Knowing the goals and the objectives, a data mining problem is formed and a plan to achieve these goals is made. Figure 1 is the CRISP-DM process and the first box covers the business understanding step.

2.1.2 Data Understanding

Data Understanding includes finding an initial data set, familiarizing with the contents in the data set, and making observations from the data set in order to create an hypothesis. The initial data set must be reliable and accurate, which means that it is not outdated and has correlation within itself. The second box from Figure 1 indicates the data understanding step.

2.1.3 Data Preparation

Data Preparation is the process of converting the initial, or raw, data set into the final data set. This step could alter the data to become more applicable to achieve the goal. The third box from Figure 1 covers the data preparation step.

2.1.4 Modeling

In Modeling

, artificial intelligence models are created using various machine learning algorithms. The data set prepared from the data preparation step are applied to this modeling step.

2.1.5 Evaluation

Evaluation is reviewing performance (e.g., accuracy) of models and deciding a best model. This is the process where the models are improved based a goal performance. Success criteria can be based on speed of algorithm, memory usage, or prediction accuracy. According to analysis purpose, different performance metrics are used. For example, for regression problem, the score can be used. For classification problem, an accuracy () of Equation 1, the sum of correct classification divided by the total number of classifications, can be used. Since this paper uses classification, some criteria for classification are introduced in the following.

(1)

where denotes the number of class and denotes the total number of the case in which values of -th prediction and -th observation are identical.

In this study, we used the four-performance metrics Acc (Equation 1), Precision (Equation 2), Recall (Equation 3), and F1-score (Equation 4). These metrics can be easily calculated by using the following four indicators.

  • True Positive (TP): The amount of the observed positive values which were correctly predicted

  • False Positive (FP): The amount of the observed positive values which were wrongly predicted

  • False Negative (FN): The amount of the observed negative values which were wrongly predicted

  • True Negative (TN): The amount of the observed negative values which were correctly predicted

These four indicators can be used to define the equations of Precision and Recall as shown.

(2)
(3)

Precision is commonly used to measure the influence of False Positive, while Recall is used to measure the influence of False Negative. F1-score is defined as the weighted average of Precision and Recall.

(4)

Precision, Recall, and F1-score have a score of one when the prediction is perfect. For the total prediction failure, they yield a score of zero.

2.1.6 Deployment

Deployment is the process of organizing the information gained, such as the model, so that it is understandable for the model user. This step is carried out by the user rather than the analyst, so it is essential for the user to understand how to use the models. The result of this step is a final report. The last step in Figure 1 represents the deployment step.

2.2 Machine Learning Models

In this subsection, we introduce machine learning models that this paper utilizes. Since Linear Regression model is a basic model in machine learning, we start with this model. Then, other models (i.e., Logistic Regression Classifier,

Gaussian Naive Bayes Classifier

, Decision Tree Classifier, Random Forest Classifier, and Gradient Boosting Classifier) are introduced.

2.2.1 Linear Regression Model

Linear Regression model represents a linear trend of data and is used to predict a continuous response using predictor variables . We define Linear Regression model formally.

(5)

where denotes weights (or coefficients) {, , …, }, denotes features {, , …, }, and denotes the number of weights.

In these settings, machine learning of Linear Regression model is to estimate better weights (or parameters)

which fits data. A simple approach is a least squares approach in which the parameters are found by minimizing a residual sum of squares (RSS).

(6)

where denotes an i-th residual error for i-th data (or row) consisting of a pair of and denotes the number of data.

Using RRS (Equation 6), a gradient descent algorithm can be applied to find the best weights fitting training data. The process of the gradient descent algorithm follows: (1) Initial weights are randomly selected, (2) The weights are updated according to

(7)

where denotes a previous weight of and is a learning rate, (3) if , where denotes a threshold for the convergence, then stop.

2.2.2 Logistic Regression Classifier

Logistic Regression Classifier is an extension of the Linear Regression model. Especially, the response in Logistic Regression is discrete (e.g., True and False). In other words, given predictor variables , a discrete value is predicted. A simple case of the Logistic Regression is a binary Logistic Regression model whose output is Boolean. The binary Logistic Regression model can be defined formally as shown Equation 8.

(8)

where is a natural exponential function and is identical to Equation 5.

For a general Logistic Regression, whose output is multiple, a softmax function can be applied to represent a probability distribution for multiple classes. Equation

9 shows the general Logistic Regression.

(9)

where denotes the number of classes.

For machine learning, the Logistic Regression model requires a loss function to measure similarity between the leaned model and data, like the residual sum of squares in the Linear Regression model. Equation

10 shows the loss function of the general Logistic Regression.

(10)

where is a function return one is true, zero otherwise.

This loss function is used for the gradient descent algorithm (see Equation 7) to find best weights by substituting RSS in Equation 7 with the Loss function.

2.2.3 Gaussian Naive Bayes Classifier

Naive Bayes (NB) Classifier maron1961automatic

is a classification method based on probability theory, by which one can classify class labels using (discrete or continuous) inputs. Basically, NB represents a joint distribution for input random variables (

) and a class random variables () on the assumption of conditionally independence of given . NB can be written as Equation 11.

(11)

where is the number of input random variables and denotes a class label in .

Gaussian Naive Bayes Classifier uses continuous inputs under the Gaussian (or Normal) distribution assumption.

(12)

where and

denote the mean and standard deviation of the input

for the class label , respectively.

For machine learning of Gaussian Naive Bayes Classifier, Maximum Likelihood Estimation (MLE), Maximum a Posteriori Probability (MAP), and Bayesian approach can be used murphy2012machine.

2.2.4 Decision Tree Classifier

Decision tree Classifier quinlan1986induction; morgan1963problems; safavian1991survey; magerman1995statistical consists of a tree structure containing a set of hierarchical nodes (Root Node, Internal Nodes, and Leaf Nodes). The root node and the internal nodes represent features or variables, while the leaf nodes denote values of a target variable. The main challenge of machine learning for decision tree is to construct these nodes and their hierarchy in a decision tree, so that it can effectively classify classes using input data of predictor variables. The basic approach, called ID3 (Iterative Dichotomiser 3), to build a decision tree was introduced by quinlan1986induction. For example, suppose that there is a target variable with two classes (Positive and Negative), the expected information for this can be written by . Note that the following equations are taken from quinlan1986induction.

(13)

where and denote the numbers of positive and negative cases, respectively. The expected information for a parent node of the target variable can be derived as the weighted average.

(14)

where denotes the number of the parent node values and denotes the expected information for the th value of the parent node. The information for the node can be obtained as the following equation.

(15)

Thus, machine learning for decision tree is to find a tree on maximizing the information for all the root and internal nodes

2.2.5 Random Forest Classifier

A set of ML models can often have a better performance than the use of a simple ML model. Such integration of ML models is called an ensemble learning. Random Forest Classifier ho1995random; ho2002data uses the ensemble learning by forming a set of decision trees and resulting in an output which are voted from each decision tree. Random Forest draw random samples from training data and learn a decision tree model from the sample data, so that it can have a set of decision trees (i.e., forest). After machine learning, in the prediction (or application) stage, the class voted by the majority of learned decision trees is chosen as the final result. The following shows an equation for such majority voting.

(16)

where is a single decision tree and the function yields the output as the class label that is the most frequent class among the set of classification results.

2.2.6 Gradient Boosting Classifier

Gradient Boosting Classifier breiman1996arcing uses an ensemble model consisting of a set of simple models (e.g., a decision tree stump, a tree containing only one root and its immediately connected leaf nodes). By adding such simple models, the result ensemble model can be sequentially improved and finally fitted to data. In other words, after applying a simple model, samples which are classified by it are reused to another simple model. And then this process is repeated until convergence (or achieving better predictive performance). Gradient Boosting Classifier is a generalized method of boosting (e.g., schapire1990strength; freund1996experiments) by using gradient of a loss function.

3 Analysis Methods

In this section, we introduce the specific processes of our analysis and the results from the analysis.

3.1 Business Understanding

In this paper, our goal has two folds of objectives. (1) The first goal is to determine the most accurate machine learning model between Linear Regression, Gaussian Naive Bayes, Decision Tree, Random Forest, and Gradient Boosting. (2) The second is to construct a public web that will allow adolescents to know whether they will have the intention to smoke or not.

We utilize the data from NYTS for 2018 in order to construct ML models, and the models are analyzed to choose the most accurate one in predicting the intention for a non-smoker to smoke cigarettes.

3.2 Data Understanding

We found reliable data from the National Youth Tobacco Survey 2018 2018NYTS_DataSet, which is a nation-wide survey in US for middle (grades 6-8) and high school (9-12) youth’s tobacco-related beliefs, attitudes, behaviors, and exposure to pro and anti-tobacco influences. NYTS implements a three-stage cluster sampling design in order to get a nationally representative data for students from grades 6-12 in all 50 states and District of Columbia NYTS. However, the data covers only the youth who are currently attending middle and high school, which means that the results would be inapplicable for youths who are not attending middle and high school.

There are a total of 88 questions in the survey (Table 8), and we examined all of the questions and picked the questions that were necessary making the model to predict the whether a person has the intention to smoke or not. Table 1 describes the format of the questions and the answer choices as written on the questionnaire. For example, in the first row of Table 1, Q1 represents question 1 and the next column describes what the actual question is, which is "How old are you?" The possible answer choices is displayed on the next column and they are ranged from ages 9-19 years old.

Question Number
Question
Answers
Q1 How old are you? 9, 10, …, 19 years old
Q2 What is your sex? Male/Female
Q3 What grade are you in? 6, 7, …, 12, ungraded or other grade
Q4 Are you Hispanic, Latino, Latina, or of Spanish origin? Select one or more
Q88 Because of a physical…making decisions? No/Yes
Table 1: Illustrative Example of Questions

3.3 Data Preparation

Our original data was taken from the National Youth Tobacco Surveys 2018. This data set was compiled of questions represented as questions Q and answers represented by numbers (e.g., 1, 2, 3, and 4) and words (e.g., Yes and No). We filtered every question to prepare for machine learning. For example, any null in the answer meant that the question was not answered. We went through the process of replacing all the nulls with 0’s, which represents the unanswered choices.

The next process involved classifying related data. When we first found the data, it was a set of 20189 rows x 195 columns. As mentioned in Section 1, our goal is to construct a prediction model construct a public web. In order to achieve this goal, we needed to divide the data into two groups: never smoked cigarette users and ever smoked cigarette users. Then, all the rows containing individuals who have ever smoked in their lives were deleted, because we only need to analyze the youths who never smoked cigarettes. The purpose of the first split of the cigarette and non-cigarette users is to form a predictive model. Next, we extracted our target question and the questions pertaining to it. There are 88 questions within the survey we used, but not every question was related to our prediction. After examining all the survey questions, we chose specific questions to be used, since not all the questions were applicable to our goal and redundancies and indirect correlations were present. Out of 88 questions, 47 questions were used, and specifically questions 15, 16, 17, 43, 44, and 45 "do you think that you will try a cigarette soon?", "do you think you will smoke a cigarette in the next year?", "if one of your best best friends were to offer you a cigarette, would you smoke it?", "do you think that you will try smoking tobacco in a hookah or waterpipe soon?", "do you think you will smoke tobacco in a hookah or waterpipe in the next year?", "if one of your best friends were to offer you a hookah or waterpipe with tobacco, would you try it?", respectively, was our target questions, because they are all related to someone’s intention to smoke a cigarette, no matter how small the intention. We decided that the answer choices for target questions (definitely yes, probably yes, probably no, definitely no) could be simplified to two answer choices by considering definitely yes and probably yes as "yes" and definitely no and probably no as "no".

3.4 Modeling

We used five ML algorithms to generate each ML model using a training data and evaluated the accuracy for each one of the models. The models include (1) Decision Tree Classifier, (2) Gaussian NB Classifier, (3) Logistic Regression Classifier, (4) Gradient Boosting Classifier, and (5) Random Forest Classifier. Out of the whole data set in Subsection 3.3, we assigned the training data set, which is 80 percent of the original data, and the test data set, which is 20 percent of the original data. The training data set is used to learn the prediction model while the test data set is used to test the learned model by evaluating the accuracy for each model. After examining the predicted accuracy, we can evaluate the model that showed the highest accuracy, which can be seen in Subsection 3.5.

3.5 Evaluation

Table 2 to Table 6 shows the results of the prediction accuracy that was taken from the five machine learning models. Note that the terms precision, recall, and F1-score that are used for the table are explained in Subsection 2.1.5. The first column for each table lists the answer choices for question 16 "Do you think you will smoke a cigarette in the next year?". This step is to see how the factors are related to our target question, Q16, and we used machine learning models to find which factor affects people to answer Q16 which is intention to smoke cigarettes.

Precision
Recall
F1-Score
Cases
Yes 0.60 0.42 0.49 201
No 0.94 0.97 0.96 2049
Macro Avg 0.77 0.70 0.72 2250
Weighted Avg 0.91 0.92 0.92 2250
Table 2: Decision Tree (DT)
Precision
Recall
F1-Score
Cases
Yes 0.25 0.77 0.38 201
No 0.97 0.77 0.86 2049
Macro Avg 0.61 0.77 0.62 2250
Weighted Avg 0.91 0.77 0.82 2250
Table 3: Gaussian Naive Bayes (NB)
Precision
Recall
F1-Score
Cases
Yes 0.66 0.30 0.41 201
No 0.93 0.98 0.96 2049
Macro Avg 0.80 0.64 0.69 2250
Weighted Avg 0.91 0.92 0.91 2250
Table 4: Logistic Regression (LR)
Precision
Recall
F1-Score
Cases
Yes 0.65 0.35 0.46 201
No 0.94 0.98 0.96 2049
Macro Avg 0.79 0.67 0.71 2250
Weighted Avg 0.91 0.92 0.91 2250
Table 5: Random Forest (RF)
Precision
Recall
F1-Score
Cases
Yes 0.71 0.37 0.49 201
No 0.94 0.99 0.96 2049
Macro Avg 0.83 0.68 0.73 2250
Weighted Avg 0.92 0.93 0.92 2250
Table 6: Gradient Boosting (GB)

The precision, recall, F1-score, and cases results are all shown in the tables above. The Gradient Boosting Classifier showed the most accurate results out of all the other models. Although the F1-score of the answer choice "Yes" (0.49) of the Gradient Boosting Classifier is one of the highest models in the F1-score, it does not show a high enough value. Possible reasons for the low F1-score is that the NYTS data was insufficient and limited to create an accurate model. In addition, the ML models might not have been the best choices and there might be better algorithms to fit the NYTS data. Finally, the questions, or x variables, given and chosen might not be the most fitting ones and there are better options. Potential questions, or x variables, that could have been added to the NYTS data are detailed family history, respondent’s location, and possibly the personality or qualities of the respondent.

Table 7 and Figure 2 shows the accuracy of training score and testing score of each machine learning model.

Decision Tree
Gaussian NB
Logistic Regression
Random Forest
Gradient Boosting
Training Score 0.9298 0.7695 0.9352 0.9298 0.9366
Test Score 0.9226 0.7728 0.9235 0.9248 0.9306
Table 7: Accuracy Results for ML Models in terms of Training and Test

The training scores resulted from Cross Validation (CV). The training scores for Decision Tree, Gaussian NB, Logistic Regression, Random Forest, and Gradient Boosting were 0.9298, 0.7695, 0.9352, 0.9298, and 0.9366, respectively. Also, the test scores were 0.9226, 0.7728, 0.9235, 0.9248, and 0.9306, respectively.

Figure 2 has the x-axis represented as the ML models and the y-axis represented as the training score, test score, macro average, and weighted average. The figure shows that Gradient Boosting classifier has the highest accuracy among others.

Figure 2: Accuracy Results for the Five Algorithms

3.6 Deployment

Our deployment process is through a public web page in which users (especially adolescents) enter their information for predicting their possibility of smoking cigarettes. The public web page (http://nyts.pythonanywhere.com) gives the users instant access and quick response for the predicted result. The architecture and structure of the web page is introduced in Section 4.

4 Prediction System for Future Smoking

The best prediction model (i.e., Gradient Boosting) introduced in Section 3 is utilized in the public web page (http://nyts.pythonanywhere.com) on the purpose of anti-smoking campaign for teenagers. This public web aims to inform teenagers of the possibility of future smoking, so that it may be helpful to prevent bad consequences of adolescents’ health. In this section, we introduce the public web we developed.

Figure 3: Website Sequence Diagram

Figure 3 shows a sequence diagram representing about how the user (e.g., e-cigarette or non-smoking teenagers) interacts with the website. The user, web server, and ML model reasoner represented by the boxes denotes main entities. The ML model reasoner contains the Gradient Boosting model that was learned in Section 3 using the NYTS data. The path for each entity goes from top to bottom and the vertical rectangle means the life cycle of the entity. The arrow lines across the vertical rectangles represent event flows. The following list summarize the event flows.

  1. A user enters the web server.

  2. The web server shows 47 questions to the user.

  3. The user inputs all answers for the questions.

  4. The web server receives the answers from the user and sends them to the ML model reasoner.

  5. The ML model reasoner performs prediction using the answers and sends the results back to the web server.

  6. The web server displays the prediction results using a chart.

The web server was developed in the Python environment. We used Flask v1.1.1, a Python-based web server, and "pythonanywhere.com" for a web hosting. The Gradient Boosting model was learned using Sciki-Learn v0.21.3 and was stored using Pickle python module to serialize and de-serialize for the learned model. For reasoning of the model, again Sciki-Learn v0.21.3 was used. The web page example can be found in Figure 4.

5 Conclusion

E-cigarette use has increased among adolescents. This is a worldwide problem, because it has been stated in many researches mentioned in the introduction that e-cigarette use can cause future use of cigarettes. Since e-cigarette is a recent rising issue, there is little research done on this topic, compared to smoking cigarettes. Even among the researches done, there is a lack of researches implementing prediction models, which are more practical in preventing adolescents from using (e-)cigarettes. Thus, we researched using the 2018 NYTS data and developed multiple prediction models to predict a adolescent’s intention to smoke cigarette.

The most accurate prediction model was Gradient Boosting Classifier with an overall accuracy of 93%. This model was applied in the website we designed to allow the public to input their information in respect to tobacco products, including e-cigarette, cigarette, and cigar. With this information, the algorithm can predict the respondee’s probability of future of smoking. This will help the public become more aware about certain factors in their lives and be attentive about their drug use or how their environment can affect their intention to smoke cigarettes.

Further research could include a wider range of ages, since our research is mainly focused on adolescents rather than adults. In order to improve the accuracy of the prediction model, it is essential to increase the amount of data or choose better, more fitting, variables.

no

Appendix A

a.1

These are the list of questions that were used for data analysis and what they were used for. The "Number" column specifies the question number from the 2018 NYTS questionnaire. The "Question" column lists out the exact question used from the questionnaire and "Where to Use" column specifies the purpose of using the question. The term "Predictor Variable" means that the question was considered a factor, or x variable, in predicting the intention for a person to smoke. "Data Selection for Non-Smoker" means that the question was utilized to separate the data into non-smoker and smoker groups. "Data Selection for Smoking Intention" means that the question was used as a target question, or a y variable, because the goal is to predict intention to smoke cigarette. "Data Selection for Non-E-Smoker" means that the question was used to separate the non-e-cigarette smokers from the e-cigarette smokers.

Number Question Where to Use
Q1 How old are you? Predictor Variable
Q2 What is your sex? Predictor Variable
Q3 What grade are you in? Predictor Variable
Q4 Are you Hispanic, Latino, Latina, or of Spanish origin? Predictor Variable
Q5 What race or races do you consider yourself to be? Predictor Variable
Q6 Have you ever been curious about smoking a cigarette? Predictor Variable
Q7 Have you ever tried cigarette smoking, even one or two puffs?
Data Selection
for Non-Smoker
Q15 Do you think that you will try a cigarette soon?
Data Selection
for Smoking Intention
Q16 Do you think you will smoke a cigarette in the next year?
Data Selection
for Smoking Intention
Q17
If one of your best friends were to offer you a cigarette,
would you smoke it?
Data Selection
for Smoking Intention
Q18
Have you ever been curious about smoking a cigar, cigarillo,
or little cigar?
Predictor Variable
Q19
Have you ever tried smoking cigars, cigarillos, or
little cigars even one or two puffs?
Data Selection
for Non-Smoker
Q23
Have you ever been curious about using chewing tobacco, snuff,
or dip?
Predictor Variable
Q24
Have you ever used chewing tobacco, snuff, or dip,
such as Redman, Levi Garrett, Beechnut, Skoal, Skoal Bandits, or Copenhagen,
even just a small amount?
Data Selection
for Non-Smoker
Q27
Have you ever been curious about using an e-cigarette?
Predictor Variable
Q28
Have you ever used an e-cigarette, even once or twice?
Data Selection
for Non-E-Smoker
Q29
How old were you when you first tried using an e-cigarette,
even once or twice?
Predictor Variable
Q30
In total, on how many days have you used e-cigarettes
in your entire life?
Predictor Variable
Q31
During the past 30 days, on how many days did you use e-cigarettes?
Predictor Variable
Q32
During the past 30 days, where did you get or buy
the e-cigarettes that you have used?
Predictor Variable
Q33
What are the reasons you have used e-cigarettes?
Predictor Variable
Q34
Have you ever used marijuana, marijuana concentrates,
marijuana waxes, THC, or hash oils in an e-cigarette?
Predictor Variable
Q35
Do you think that you will try an e-cigarette soon?
Predictor Variable
Q36
Do you think you will use an e-cigarette in the next year?
Predictor Variable
Q37
If one of your best friends were to offer you an e-cigarette,
would you use it?
Predictor Variable
Q38
Have you ever been curious about smoking tobacco in a hookah or waterpipe?
Predictor Variable
Q39
Have you ever tried smoking tobacco in a hookah or
waterpipe, even one or two puffs?
Data Selection
for Non-Smoker
Q43
Do you think that you will try smoking tobacco in
a hookah or waterpipe soon?
Data Selection
for Smoking Intention
Q44
Do you think you will smoke tobacco in a hookah or
waterpipe in the next year?
Data Selection
for Smoking Intention
Q45
If one of your best friends were to offer you a hookah or
waterpipe with tobacco, would you try it?
Data Selection
for Smoking Intention
Q59
During the past 30 days, did anyone refuse to sell you
any tobacco products because of your age?
Data Selection
for Non-Smoker
Q61
During the past 30 days, how often did you see a warning label
on a cigar, cigarillo, or little cigar package?
Predictor Variable
Q62
During the past 30 days, how often did you see a warning label
on an e-cigarette package?
Predictor Variable
Q63
During the past 30 days, how often did you see a warning label
on a package of hookah tobacco?
Predictor Variable
Q64
In the past 12 months, have you seen or heard The Real Cost,
on television, the internet, social media, or radio as part of ads about tobacco?
Predictor Variable
Q65
How much do you think people harm themselves when they smoke
cigarettes some days but not every day?
Predictor Variable
Q66
How much do you think people harm themselves when they use
chewing tobacco, snuff, dip, or snus, some days but not every day?
Predictor Variable
Q67
Do you believe that chewing tobacco, snuff, dip, or snus is
(LESS ADDICTIVE, EQUALLY ADDICTIVE, or MORE ADDICTIVE)
than cigarettes?
Predictor Variable
Q68
How much do you think people harm themselves when they use
e-cigarettes some days but not every day?
Predictor Variable
Q69
Do you believe that e-cigarettes are (LESS ADDICTIVE,
EQUALLY ADDICTIVE, or MORE ADDICTIVE) than cigarettes?
Predictor Variable
Q70
How much do you think people harm themselves when they
smoke tobacco in a hookah or waterpipe some days but not every day?
Predictor Variable
Q71
Do you believe that smoking tobacco in a hookah or waterpipe
is (LESS ADDICTIVE, EQUALLY ADDICTIVE, or MORE ADDICTIVE)
than cigarettes?
Predictor Variable
Q72
How strongly do you agree with the statement ‘All tobacco products
are dangerous’?
Predictor Variable
Q73
Not including the vapor from e-cigarettes, do you think
that breathing smoke from other people’s cigarettes or other tobacco products causes
Predictor Variable
Q74
When you are using the Internet, how often do you see ads
or promotions for cigarettes or other tobacco products?
Predictor Variable
Q75
When you read newspapers or magazines, how often do you see ads
or promotions for cigarettes or other tobacco products?
Predictor Variable
Q76
When you go to a convenience store, supermarket, or
gas station, how often do you see ads or promotions
for cigarettes or other tobacco products?
Predictor Variable
Q77
When you watch TV or go to the movies, how often do you see
ads or promotions for cigarettes or other tobacco products?
Predictor Variable
Q78
When you are using the Internet, how often do you see ads
or promotions for e-cigarettes?
Predictor Variable
Q79
When you read newspapers or magazines, how often do you see ads
or promotions for e-cigarettes?
Predictor Variable
Q80
When you go to a convenience store, supermarket, or gas station,
how often do you see ads or promotions for e-cigarettes?
Predictor Variable
Q81
When you watch TV, how often do you see ads or promotions for e-cigarettes?
Predictor Variable
Q82
During the past 7 days, on how many days did someone smoke
tobacco products in your home while you were there?
Predictor Variable
Q83
During the past 7 days, on how many days did you ride in a vehicle
when someone was smoking a tobacco product?
Predictor Variable
Q84
During the past 30 days, on how many days did you breathe
the smoke from someone who was smoking tobacco products
in an indoor or outdoor public place?
Predictor Variable
Q85
During the past 30 days, on how many days did you breathe
the vapor from someone who was using an e-cigarette
in an indoor or outdoor public place?
Predictor Variable
Q86
Does anyone who lives with you now…?
Predictor Variable
Q87
Do you speak a language other than English at home?
Predictor Variable
Q88
Because of a physical, mental, or emotional condition,
do you have serious difficulty concentrating,remembering, or making decisions?
Predictor Variable

a.2

This is a glimpse of the website. After answering total of 47 questions and clicking submit, the bubble will fill up and show the probability of a user to smoker in the future.

Figure 4: Website Example

References

yes

References