Stroke, an acute cerebrovascular disease, is caused by brain tissue damage due to abnormal blood supply to the brain with cerebrovascular blockage. It includes hemorrhagic stroke and ischemic stroke. According to the Global Burden of Disease, Injuries and Risk Factor Study and other researches [liu2007stroke, roth2018global, zhou2019mortality], stroke is the third leading cause of death in the world and the first in China. Recent studies from National Epidemiological Survey of Stroke in China(NESS-China) [wang2017prevalence] show the prevalence of stroke in China during 2012-2013:
|Region||Prevalence (Per 100000)||Incidence (per 100000)||Mortality (per 100000)|
Investigation into risk factors of getting stroke is essentially important for the prevention of stroke. Research shows that risk factors can be divided into two categories: reversible factors and irreversible factors.
Reversible factors mainly refer to unhealthy lifestyles such as smoking, excessive alcohol consumption and physical inactivity; while irreversible factors mainly refer to chronic diseases such as hypertension, diabetes, and hyperlipidemia. A number of researches on stroke risk analysis have been done for the European and American populations [vartiainen2016predicting, lumley2002stroke]. However, they can not be directly applied to the Chinese population due to racial difference.
In China, stroke-related research is mostly carried out on risk prediction models with pathogenic factors. The most widely used one is the 10-year risk prediction model using cardiovascular and cerebrovascular diseases to give probability of stroke and coronary heart disease incidence. The CHINA-PAR project (Prediction for ASCVD Risk in China) led by Gu Dongfeng’s team [yang2016predicting] proposed a revised model which considered not only the 10-year risk but also a lifetime-risk assessment. By analyzing data on the incidence of stroke in 32 of 34 provincial regions of China, Xu et al. [xu2013there] concluded that there is a stroke belt in north and west China.
In recent years, some machine learning methods have been applied to the stroke prediction. In 2010, a combination of Support Vector Machine and Cox Proportional Hazard Model was proposed by Khosla et al.[khosla2010integrated]. Benjamin [letham2015interpretable] implemented an interpretable method using Decision List with Bayesian Analysis to quantify the probability of stroke. Chi-Chun Lee’s team [hung2017comparing, hung2019development]
compared multiple methods including Deep Neural Network in stroke prediction with Electronic Health Records (EHR). In their research, they focus on the patient’s 3 year stroke rate and 8 year stroke rate. However, few of these studies modeled the early screening and prevention of stroke.
Evaluation of the risk of getting stroke is important for prevention and treatment of stroke in China. The China National Stroke Prevention Project (CSPP) proposed “8 + 2” main risk factors in identifying Chinese residents’ risk level of getting stroke [yu2016csdc, li2019using, chao2021stroke]:
Heart disease(includes atrial fibrillation and valvular heart disease)
Family history of stroke
The history of stroke
The history of Transient Ischemic Attack (TIA)
With the above proposed “8+2” main risk factors, the risk level of getting stroke can be classified into:
High risk: having at least three factors from factor 1 to 8; or one of a and b;
Medium risk: having less than three risk factors from factor 1 to 8 with at least one being factor 1, 2 or 3;
Low risk: having less than three risk factors from factor 4 to 8.
However, the ranking of the risk factors may present differently in different provinces. Based on the census data in both communities and hospitals from Shanxi Province, in this paper, we investigates different stroke risk factors and their ranking. It shows that hypertension, physical inactivity (lack of sports), and overweight are ranked as the top three high stroke risk factors in Shanxi. The probability of getting a stroke is also estimated through our interpretable machine learning methods. The study provides theoretical support for stroke prevention and control in Shanxi Province.
2 Materials and Methods
2.1 Dataset and Preprocessing
Our data is composed by two survey datasets from 2017 to 2020:
Census in hospital: 2000 hospitalized stroke patients in 2018;
Census in community: 27583 residents during the year 2017 to 2020. This dataset is categorized and labeled with the CSPP’s taxonomy: Low risk (11739), Medium risk (7630) and High risk (8214).
Each record in both datasets contains 177 features, not only providing information on the “8+2” risk factors but also patients’ other information:
|Demographic information||Sex, Ethnicity, etc.|
|Lifestyle information||Smoking, Alcohol consumption, etc.|
|Medical measurement||Blood pressure, fasting blood glucose, etc.|
|Surgery information||history of surgery (PCI, CABG, CEA, CAS)|
|Chronic diseases information||diagnosis times, what kinds of treatment|
Data cleansing is a preparation process in data analysis by removing or correcting data that is corrupt or inaccurate. The raw data in the above datasets needs to be cleaned due to data incompletion, inconsistence and non-structured formats which may lead to a failure of feature engineering. In this paper, missing values of a feature are filled with -1 as an abnormal class. If there is over missing inside the column, we will delete it since the data from the column cannot provide much information. Inconsistent values are found and corrected with prior medical knowledge. For instance, diastolic blood pressure should be lower than systolic blood pressure.
After the data cleansing, there are total 23289 records (low: 9718, mid: 6742, high: 5610) with 32 features in remains, shown in Table 2.
|Class||Feature Name||Data Type|
|Lifestyle Information||Frequency of Vegetables||Categorical|
|Lifestyle Information||Frequency of Fruits||Categorical|
|Lifestyle Information||Meat and Vegetables||Categorical|
|Lifestyle Information||Medical Payment Method||Categorical|
|Demographic Information||Marital Status||Categorical|
|Demographic Information||Education Level||Categorical|
|Medical Measurement||Systolic blood pressure||Numerical|
|Medical Measurement||Diastolic blood pressure||Numerical|
|“8+2” Factor and Lifestyle Information||Smoking||Categorical|
|“8+2” Factor and Lifestyle Information||Physical Inactivity||Categorical|
|“8+2” Factor and Medical Information||Heart Disease||Categorical|
|“8+2” Factor and Medical Information||Hypertension||Categorical|
|“8+2” Factor and Medical Information||Hyperlipidemia||Categorical|
|“8+2” Factor and Medical Information||History of Stroke||Categorical|
|“8+2” Factor and Medical Information||Diabetes Melltius||Categorical|
|“8+2” Factor and Medical Information||Family history of Stroke||Categorical|
|“8+2” Factor and Medical Information||History of Transient Ischemic Attack||Categorical|
Decision-Tree is a classic non-parametric machine learning algorithm. A tree is created through learning decision rules inferred from data features. Starting from the top root node, data are split into different internal nodes according to certain cutoff values in features, and then finally arrive the terminal leaf nodes which give the final classification result. ID3 [quinlan1986induction] and CART [breiman1984classification] are classic Decision-Tree algorithms which employ Information Gain and Gini Impurity from Entropy Theory [pal1991entropy] as measurements in making best splitting rules.
Random-Forest is a machine learning algorithm proposed by Leo Breiman[breiman2001random]
in 2001. Instead of using one decision tree which is nonunique and may exhibits high variance, random forest generates a number of individual decision trees operating as a committee. Bootstrapping technique is used to train the individual decision trees in parallel on different sub datasets and features with random sampling with replacement. The final decision of classification is aggregated by voting and averaging. With the wisdom of crowds, random forest can easily overcome overfitting problem and reduce model bias caused by data imbalance, and thus shows good generalization.
Logistic model is a generalized linear model which is widely used in data mining. It assumes that the dependent variable
where and is the number of features.
Assumes that represents a binary outcome , and is an array of their features and is the coefficient of feature [peduzzi1996simulation]. The coefficient in logistic regression is called log odds
log oddsand used in logistic regression equation for the prediction of the dependent variable from the independent variable , let :
In practice, Logistics Regression can be used in multiple aspects, for instance, advertising, disease diagnosis as it can provide the possibility of a user buying a certain product and the possibility of a certain patient suffering from a certain disease. In our case, we want to use the “8+2” risk factors and resident’s lifestyle factors as input and give out the probability of stroke incidence, which can provide a forward-looking prediction.
2.3 Model’s Interpretation
The model’s interpretability and explanations are crucial for the medical data analysis: the medical diagnosis system must be transparent, understandable, and explainable. Therefore, the doctor and the patient can know how the model makes decisions, which features are important, and how the features affect the model’s result [ahmad2018interpretable, molnar2020interpretable]. In this section, we mainly introduce the feature importance, permutation importance and SHAP value which can help interpret the model.
Feature importance, also called as Gini importance or Mean Decrease Impurity (MDI)[breiman2001random, archer2008empirical, scikit-learn], is the average of node impurity decrease of each variable and weighted by the probability of the sample reaching to that node. For Random-Forest model, assumes that the response is and to calculate the average variable importance of feature with trees:
Where is the weighted impurity decreases for feature in all nodes , is the probability of the sample reading to node () and is the impurity measure at node with split . is the variable used in the split (split means the split at node . Hence, means at node , the splitting identifier is variable ).
For the Decision-Tree model, it only contains one tree, that is, , and its feature importance can be rewritten as:
Permutation Importance [breiman2001random, altmann2010permutation, fisher2019all, scikit-learn], is used in answering how a certain feature influence the overall prediction, as it evaluates the changes of model prediction’s accuracy by permuting the feature’s values. If let represent the model accuracy with the full dataset , then the permutation feature importance of feature is:
where represents the repetition in times shuffling for feature , as the model accuracy in modified dataset with feature shuffled. With the average changed accuracy before and after shuffling can we evaluate the importance of feature.
These two importance values can show which feature is more important, however, it is unavailable for us to know whether the feature has a positive or negative effect respect to the output.
SHAP(Shapley Additive explanations)
can provide not only the importance of the features but also show how much each feature contributes, either positively or negatively, to the target variable. It is a method to explain each individual prediction. This idea of SHAP value comes from the Shapley value in game theory. Shapley value tells how to fairly distribute the contributions among the features, which is the marginal contributions for each feature[shap1953definition].
The goal of SHAP is to explain the prediction of an instance by computing the contribution of each feature to the prediction model. The formula of SHAP is an addictive feature attribution linear model, and it is shown below:
With this method, we calculate how the feature contributes to each coalition of each decision tree model and sum them up to get the total contribution of the whole prediction model. In this equation, represents all the possible subsets without feature , represents the sub-feature set that did not contain the result, represents the model output (precision, recall or accuracy, etc.) after feature is added to subset , represents the model output using subset . With the multiplication for the occurrence probability for each subsets without that feature and the output different with and without that feature, the marginal contribution of each feature is calculated.
SHAP has three properties: local accuracy, missingness, and consistency [shap2017interpretation]. Local accuracy means that when approximating the original model for a specific input , local accuracy requires the explanation model to at least match the output of the model for the simplified input . Missingness means that if there is a feature missing in the sample, it does not affect the output of the model. Consistency means that when the model changes and the marginal contribution of a feature increases, the corresponding Shapley value will also increase. Therefore, it is more accurate and scientific in interpreting machine learning models due to those three characteristics.
3.1 Main Risk Factors Ranking
Due to geographic and cultural differences, the same disease may have different manifestations in different region. We hope to find the topmost influential factors in Shanxi Province. Table 3 shows the proportion of each risk factor’s Exposure rate and the Risk attribution (RA) based on our data:
|Feature||Exposure rate||Risk attribution|
|Family history of stroke||0.0999||1.590|
|History of stroke||0.0483||NA|
|History of TIA||0.0023||NA|
To assess the ranking of main risk factors, in the first experiment, we used the dataset 2 with the “8+2” factors as feature and implemented the Decision-Tree model. Table 4 shows the classification result:
Figure 1 shows the feature importance and permutation importance based on the Decision Tree model, which shows the ranking of these main risk factors: both evaluation methods confirm that hypertension, physical inactivity, and hyperlipidemia are estimated as the top three informative features in the Decision-Tree model.
3.2 Lifestyle and Medical Measurement Ranking
For the second experiment, we would like to identify more risk factors for Shanxi Province besides the ”8+2” risk factors by using the dataset 2 with features such as lifestyle habits and medical measurement. Table 5 shows the classification result and Figure 4 shows the feature and permutation importance:
The results shown in Figure 4 confirm that systolic blood pressure, diastolic blood pressure, physical inactivity, BMI, smoking, FBG, TG, HDL, family history of Stroke and weight are the top ten factors when we only consider the lifestyle habits, demographic information, and medical measurement. These factors are, medically, highly corresponding to the Chronic diseases [levy2009genome, decode2001glucose, wu2007cut].
To give out the specific details on how each feature contributes to each individual, we calculate the SHAP value in the Random Forest model and use the summary-plot to show their importance. The figure of ordered mean sample SHAP value for each feature is shown in Figure 5. It shows the distribution of the contributions each factor has on the cause of stroke. The color represents the feature value (red represents high, blue represents low). The more difference for the distribution between high feature value and low feature value, the better it would be in separating patients with different risk levels. Figure 5 shows that Diastolic Blood Pressure, Physical inactivity, Systolic Blood Pressure, BMI, Smoking, FBG, and TG are positively correlated to the risk of stroke, and HDL are negatively correlated to the risk.
3.3 Quantitative Prediction of Stroke’s Incidence
For the third experiment, a logistic model is establish to quantify the probability of stroke incidence. To achieve this goal, we combine the dataset 1 and 2, relabel the data: the original low-risk and medium-risk are now class 0, the high-risk and stroke are class 1. The features contains lifestyle information, demographic information, and the “8+2” factors.
Logistic regression is feature-sensitive. Feature selection is done before modeling. To solve the multicollinearity problem[alin2010multicollinearity], the highly relevant features are removed first. For example, we keep BMI and remove height and weight. What’s more, Variance Threshold [guyon2003introduction] is used to remove low variance features. It is a simple method of feature selection, where deletes all the features whose variance does not meet a certain threshold. For example, most of the respondent in our survey are Han Chinese, therefore, we remove ethnicity.
The logistic model results are shown in Table 6
including the features’ coefficient, standard error, and the confidence interval. According to the coefficients, it yields that History of stroke, Physical inactivity, Hypertension, Hyperlipidemia, Smoke, Diabetes Mellitus, BMI, Family history of stroke and Heart disease are positive correlated to stroke incidence; Education level, Frequency of vegetables, and Occupation are negative correlated to stroke incidence.
|y=1||coef||std err||z||95% CI [0.025, 0.975]|
|History of Stroke||2.6779||0.247||10.853||0.000||2.194||3.162|
|Family history of stroke||0.7447||0.024||30.522||0.000||0.697||0.793|
|Frequency of Fruit||0.1507||0.026||5.817||0.000||0.100||0.202|
|Frequency of Vegetables||-0.2262||0.025||-9.118||0.000||-0.275||-0.178|
The trained model is implemented on the testing set to estimate the probability of stroke incidence for each category. The results are in Table 7:
|Risk Level||Average Probability of Stroke|
Comparing with the qualitative ranking method, we quantify the risk factors of stroke, and convert the scoring grades into probabilities, making the prediction of stroke risk more intuitive. What’s more, our logistic model is based on the current actual circumstance to predict the incidence promptly, which is more time-sensitive.
4.1 The Risk Factors in Shanxi Province
Based on the treeSHAP value, feature and permutation importance of lifestyle and medical measurement, we have found the most important factors of causing stroke:
Hypertension(Diastolic Blood pressure and Systolic Blood Pressure)
Hyperlipidemia (mostly according to the HDL and TC)
Diabetes Mellitus (according to the FBG)
The treeSHAP dependence plot is applied to compare the contribution between two features. Figure 7 shows that Diastolic Blood Pressure ( 140 mmHg) is more suitable in diagnosing the risk of patients getting a stroke than the Systolic Blood Pressure ( 90 mmHg). Similarly, based on the comparison between HDL and LDL (see Figure 7), we can find that high-density lipoprotein are better in diagnosing those non-stroke patients in low high-density lipoprotein.
4.2 Feature Validity
Missing data due to the technique errors (like typos and facilities errors) is a common problem during the census analysis. To find out how might those error data in the datasets might influence, we have conducted an experiment to find out missing data in features and its influence on the final results. The Random Forest Classifier is adopted to predict the stroke risk with different proportions of a single missing feature and looped for 100 times at random locations. What’s more, to prevent the precision score didn’t change due to the strong-correlation of features, some specific feature pairs are cleaned up. The result is shown in Figure 8 :
In Figure 8, the curve for each feature represents how the average weight precision score of the feature has changed with the increasing proportion of the feature missing and the shadows are the 95% confidence area 100 times for each feature. Based on the result, we can see that diastolic blood pressure, physical inactivity, BMI, smoking, alcohol, HDL, and FBG are in order important factors when identifying the cause of stroke, while the other factors are not influencing the whole models. An interesting fact we have seen is that HDL seems to be a great influential factor when the proportion of HDL is small compared to most of the influenced factors.
What’s more, a Recursive Feature Elimination (RFE) process is also done to evaluate the specific amount of factors for analyzing the risk level of patients getting a stroke. The procedure of the RFE is as follows. First, the estimator is trained on the initial feature set and the importance of each feature is obtained from any specific or callable attribute. Then, the least important features are removed from the current feature set. This process is repeated recursively over the pruning set until the number of features to be selected is finally reached. Based on the Figure 9, we have found that approximate 7 features can help the Random Forest model to get a stable precision for different levels of risk. Therefore, the validity of those features is proved.