Obstructive sleep apnea (OSA), a form of sleep-disordered breathing characterized by recurrent episodes of partial or complete airway obstruction during sleep, is a serious health problem, affecting an estimated 1-5% of elementary school-aged children OSA1; OSA2
. Even mild forms of untreated pediatric OSA may cause high blood pressure, behavioral challenges, or impeded growth. Compared to adults, the symptoms of childhood-onset OSA are more varied and change continuously with development, making diagnosis a difficult challenge. The complexity of the data from surveys, biomedical measurements, 3D facial photos, and time-series data calls for state of the art techniques from mathematics and data science.
Clinical data, including that considered in confirming or ruling out a diagnosis of pediatric OSA, consist of high-dimensional multi-mode data with mixtures of variables of disparate types (e.g., nominal and categorical data of different scales, interval data, time-to-event and longitudinal outcomes) also called mixed or non-commensurate data. These data obtained from multiple sources are commonplace in modern statistical applications in medicine and health, with thousands, even millions, of features recorded simultaneously from each object or individual.
In this paper, we analyze symptom data provided by patients and clinicians, while in related work, we analyze polysomnography data (physiological time series data). These are case studies in a larger project in building an algorithm to aid clinicians in their treatment decisions for pediatric OSA patients. To overcome the difficulties in analyzing high-dimensional multi-mode data from multiple sources, we propose to adopt a hybrid approach that interactively combines statistics, computational topology and deep learning to take advantage of their strengths and mitigate their weaknesses. Statistics provides a suite of tools for model specification and identification, including model estimation and inference, which oftentimes entail a high computational cost. Computational topology via persistent homology aims to detect ‘true’ signals in high-dimensional data with respect to a varying model parameter (Edelsbrunner2002Topological, Zomorodian2005ComputingHomology), which contrasts with the conventional statistical approach of estimating one or more model parameters that, at best, yield signals; however, the lack of a coherent approach to statistical inference is a serious drawback. Deep learning, on the one hand, has been successfully used in speech recognition 8683453
and owes part of its practical appeal to its computational efficiency; on the other hand, it can be difficult to intuitively justify why deep neural networks work. Integrative ensemble methods would thus ideally blend statistical theory, topology and deep learning in a seamless fashion, all three working in concert, as in a musical ensemble, fully exploiting the amount of available information across multiple sources.
Here, we illustrate the first step toward to building a hybrid approach by comparing several methods in statistics, computational topology, and machine learning. In Section 2, we describe data collected from pediatric obstructive sleep apnea (OSA) study Pro00057638, at the University of Alberta. Survey and craniofacial data are easier and cheaper to collect than polysomnography (PSG) data, so for maximum clinical impact we focus this article on analyses of those data sets. For the analyses of PSG time series data, we refer the reader to WiSDM_PSG_2019.
The rest of the paper is organized as follows: we outline initial findings of our data explorations in Section 3, which provides a basis for the methods we applied to our data, described in Section 4. We compare methods in statistics and machine learning for classifying OSA patients using survey data and craniofacial data and discuss results in Section 5. We complete our article with conclusion and future research steps in Section 6.
2 Pediatric Obstructive Sleep Apnea and Data
As in adults, OSA in children is associated with cardiovascular dysfunction, neurocognitive dysfunction, behavioral issues, and metabolic consequences. OSA is also believed to negatively influence school performance and learning potential in children. The gold standard for diagnosis of pediatric OSA is polysomnography (PSG) PSG. However, in many countries, access to PSG is severely limited and many children do not have an appropriate diagnosis before treatment. Consequently, children with OSA may not be treated or some children without OSA may undergo unnecessary surgery. A simple and accessible way to identify children with OSA is needed. Finding insights within inexpensive data is a crucial first step.
The patients at risk of OSA underwent PSG, filled out questionnaires, and had 3D photos taken, which were assessed by orthodontists for craniofacial index. Normative patients (who were considered not at risk of OSA) did not undergo PSG, but filled out questionnaires and were assessed by orthodontists in the same way as patients at risk of OSA. Our analysis of PSG data can be found in WiSDM_PSG_2019.
Broadly speaking, data types can be distinguished as structured and unstructured. Examples of the former include metabolite concentrations, medical records, and survey questionnaires, while more complex high-dimensional data such as digital images (e.g., photos, CT scans, MRI), text, time series, audio, and DNA sequences are examples of the latter. Methodologies for analysis differ based on whether data are structured or unstructured. In following sections we demonstrate methods in statistics, computational topology and deep learning to classify or cluster OSA patients using survey questionnaires and craniofacial data (both structured). Our current research aims to build a foundation which will be useful for combining analytic methods for both structured and unstructured data in predicting severity of OSA.
2.1 Survey Data
Once a clinician suspects that OSA may cause troublesome symptoms in a child, OSA-specific surveys can be administered to the affected parents and child. The questionnaires analyzed here encompass the Child’s Sleep Habits questionnaire, the OSA-18 Quality of Life survey, a Health Screening Questionnaire, the Pediatric Sleep Questionnaire, and the PedsQL Pediatric Quality of Life Inventories for child and parent. In the case of pediatric surveys, many children are not old enough to read or respond to such a survey, leaving parents to report observations about symptoms as best as they can. Missing data may result from survey-takers not knowing an answer to a question or feeling uncomfortable answering a question truthfully. However, even with their shortcomings, surveys are far easier and less costly to obtain and analyze than a PSG exam. Usually, PSG exams are not covered by insurance, resulting in out of pocket costs; they require a separate appointment and overnight evaluation; and the results can take anywhere between six months and two years to come back. Conversely, surveys can be completed at the clinic during a visit, are of no additional costs to patients or their families, and can be evaluated immediately. The primary aim of employing statistical, topological, and deep learning techniques in analysis of survey data is to identify features of the data that would enable more accurate analysis of OSA in children. In addition to the “subjective” patient- and parent-provided data about symptoms and quality of life, we include dentist-gathered data about craniofacial characteristics of children where noted below.
2.2 Craniofacial Data
Of particular interest to clinicians is craniofacial data, which is a series of measurements taken to capture the shape of the face and mouth. There are two reasons for a potential preference for craniofacial data instead of survey or PSG data. First, craniofacial data is inexpensive and takes minutes to measure, and therefore takes up far fewer resources than those needed for a PSG. The accessibility of this data is demonstrated in our data set, which is complete for all craniofacial measurements. Second, craniofacial data consist of quantitative measurements, which reduces some bias that may come up in qualitative survey questions.
There are nine craniofacial measurements we consider in our analysis. Figure 9 in the appendix depicts the first eight features, while the ninth, the Craniofacial Index, is a sum of these first eight measurements. The measurements are defined as follows CFindex2014:
Profile is a measurement for the angle of the shape made by the line from the brow to the base of the nose and a line from the base of the nose to the chin when viewing the patient from the side.
Midface Deficiency quantifies the projection of the malar area below the eyes (the bones which form the eye socket and cheekbones) relative to the rest of the face.
Lower Face Height is the proportion of the length from the brow to the base of the nose to the length from the base of the nose to the bottom of the chin.
Lip Strain scores the amount of effort a patient uses to close their lips, measured by observing the muscle contractions from the front view.
Palate scores the depth of the palate (the roof of the mouth) and the arch of the palate.
Overjet is the horizontal distance between the upper incisors and the lower incisors when the patient bites down. Any measurement above 5 mm is considered severe.
Overbite is the length of vertical overlap between the upper and lower incisors.
Posterior Bite is the transverse relationship between the molars and premolars, assessed by observing the relationship between upper posterior teeth and lower posterior teeth on both sides of the mouth.
Craniofacial Index is the summation of the previous eight scores and gives a summary statistic of the craniofacial data.
Measurements 1-8 are scored on a scale from 0 to 2, where 0 indicates a normal measurement and 2 indicates a severely abnormal measurement. As a result, the Craniofacial Index can range from 0 to 16. To analyze the craniofacial data, we first looked at the overall distributions of the complete data set (187 subjects, with 76 controls and 111 patients), visualized as histograms (see Figures 10 and 11) to spot any glaring differences in distributions and calculated the Earth Mover’s Distance between each distribution to quantify those differences (see Table 9). We then conducted our analysis as described in Sections 3 and 5.
2.3 Cleaning Data
Our data consist of survey responses and craniofacial data from 200 subjects from two different clinics, with 172 observed variables. After removing subjects which did not have an OSA classification listed, we had 187 subjects remaining. At the early stage of recruiting normative children, some subjects didn’t fill out questionnaires. The reason for this was a miscommunication between the principal investigator and research collaborators. Other subjects were too young to be able to fill out any child surveys and thus had too much missing data. As a result, we first removed any subjects with no OSA classification or with more than 50% of the data variables missing. This brought our total subjects down to 173 (67 controls, 106 patients). We excluded text responses from this analysis, such as lists of medications or descriptions of pain. Yes-no questions were encoded in binary. All clock times for bed time and waking time were removed, and we encoded total sleep time in one numeric column. Finally, we removed gender, height, weight, and body mass index (BMI) due to the number of missing values. After these steps, we were left with 157 input variables.
Many of the questions in the surveys were on a Likert scale. For example, patients were asked to rank on a scale of 1-7 how much they agree with certain statements such as “I have a hard time getting out of bed," or “I feel sleepy during the day." To aid the sorting algorithms, we standardized encoding so that higher numbers indicate the presence OSA symptoms and low numbers indicated absence of OSA symptoms.
For our learning methods, imputed all missing values using theMissForest command from the MissingPy package for Python missForest
; these numbers were rounded to the nearest integer to be consistent with the raw data. We split the data such that 70% was contained in the training set and 30% was in the test set. To demonstrate stability, we applied each learning algorithm to 10 different training/test splits with the same 70/30 ratio. We used the same test and train data sets for each supervised learning algorithm unless otherwise stated. The algorithms were run on three subsets of data: survey questions only, craniofacial measurements only, and combined survey and craniofacial data.
3 Data Exploration
In this section we present superlevel sets of correlation networks CorrNet2020 and visualizations from the Mapper algorithm from topological data analysis KeplerMapper2019, which we compare to visualizations from singular value decomposition trefethen1997numerical. These explorations of the data illustrate the heterogeneity of symptoms experienced by children and shed light on the limitations of the classification algorithms explored later.
3.1 Correlation Networks
As an initial exploration of the data, we plotted super-level sets of correlation networks for questionnaire responses of the patient and control groups. Networks allow for visualizing the relationships between input variables and for comparison of those relationships across data sets. Correlation network analyses have been successfully applied in computational biology corrNetwork16, neuroscience BrainCorrNetwork, and finance FinanceNet2020. We use the most basic correlation network as outlined below. By employing correlation networks, we hope to observe which input variables interact differently in the patient group versus the control group and to use this information to inform our more rigorous analyses. For this particular method, we exclude the children’s Pediatric Quality of Life survey, as those values likely correlate strongly to the answers of the parents’ Quality of Life survey.
We plot one network representing the patient data and one network representing the control data. Each node or vertex in the network is a symptom/survey response. For each pair of survey questions and , we calculate the Pearson correlation between questionnaire responses across all respondents. If the correlation between survey responses and is greater than the threshold , then we place an edge between and . If we consider the threshold parameter , the networks obtained by varying form a filtration of the simplicial complex given by the complete graph on all nodes. The topology of these super-level sets gives information about what symptoms are more co-incident in both pediatric OSA and control patients. In Figure 1(a), the threshold is , and in Figure 1(b), the threshold is . These values were chosen as they were the values for which the graphs were sparse enough to make visual observations but also still be able to see some edges. No two variables had a correlation above 0.8, which we attribute to the size of the data set and the variability of symptom expression in OSA patients.
To make the plots easier to examine, we only show nodes with a degree of one or more; that is, no isolated points were plotted, so any variables not shown can be assumed to not meet the correlation threshold with any other variable.
In the network graphs with threshold , there are some symptom correlations which seem obvious. For example, a patient having seen an orthodontist correlates highly with a patient having received orthodontic treatment, as one is a prerequisite for the other. However, Figure 1
(a), we see that although the patient and control graphs seem to have the same number of nodes, the control group seems to have fewer connected components and more nodes per connected component. We hypothesize that the variables measured in the control group are perhaps more regularly distributed than in the patient group, that is, some subsets of variables the control group have a more concentrated joint distribution than the same subsets of variables in the patient group. Our hypothesis is further supported by Figure1(b), where the connectivity appears stronger in the control group than in the patient group. In particular, we notice that five of the craniofacial variables are in a connected component in the control group, whereas all craniofacial variables are isolated nodes in the patient group. Given the value of craniofacial measures as discussed in Section 2.2, we note that the subset of craniofacial measures and their relationship to OSA diagnoses is worth exploring separately from the survey data.
3.2 Mapper Algorithm
The Mapper algorithm from topological data analysis (Singh2007TopologicalMF, implemented as KeplerMapper2019), is a tool for abstracting high dimensional data so that one can recover the underlying topological structure (the topological nerve of the data). We here refer to the algorithm as the K-mapper algorithm to indicate both the method and the implementation. Whereas the other methods give classifications (a subject does have OSA or does not have OSA), K-mapper reveals the shape of data via a simplicial complex representation. Given the high dimensionality of the data and the suite of topological and geometric classification methods available, running the K-Mapper algorithm on our data may indicate whether such geometric and topological methods are worth implementing, or whether we might gain information from those techniques that we would not gain from traditional statistical tools. This methodology has gained some traction elsewhere in exploring medical data (diabetespaper, adhdpaper).
The mapper algorithm works as follows: first project the data set into space based on the feature coordinates. Next, using a lens function and an unsupervised clustering method set by the analyst, cover the projection with overlapping hypercubes and then cluster points within the intervals defined by the hypercubes; these clusters become nodes of the graph. Since a single point can appear in multiple nodes, draw an edge between two nodes if there is more than 60% overlap.
We chose to run the K-mapper algorithm with five intervals (hypercubes) and Euclidean distance as our lens function. We also used unsupervised -means clustering with for clustering and colored nodes by the percentage of patients with OSA. See Figure 2
(a). Given there were three clusters in five hypercubes, meaning the structure could have up to fifteen nodes and over a hundred edges, the nine nodes and eight edges of the resulting simplicial complex was surprising. We note that there are two connected components, one with a node which is less than 40% OSA and two nodes with more than 80% OSA, and one with the six other nodes, each of which have at least 70% OSA. Additionally, the homological structure is simple, with only one generator for the first homology group evident from the one loop seen in the simplicial complex. Overall, the simplicial complex raises questions about the structure of the data. Is there a manifold or subset of space where data points are more than 70% likely to be at risk for OSA? Why is the only overlap for the lowest risk cluster only above 60% with a cluster for which more than 90% of the subjects are at risk for OSA? Such questions cannot be answered by traditional statistical classification methods; we must invoke more topological based methods such as manifold learning, singular value decomposition, and density based clustering for answers.
The survey data (Figure 2(b)) produced the exact same simplicial complex as the combined data did; this may be a signal of the dominance of the survey data in the combined set (survey data had 148 features, as opposed to craniofacial data’s 8 features) or a feature of data scaling (we did not scale the data, and used Euclidean distance). This indicates that survey data do and should contribute to diagnosis decisions.
In the simplicial complexes derived from craniofacial data, there was a stark difference in the structure from the simplicial complexes derived from the combined data and the survey data alone. (See Figure 2(c), which contains the simplicial complex derived from the craniofacial subset of the combined cleaned data, and 2(d), which contains the simplicial complex derived from the complete set of craniofacial data.) First, the two craniofacial graphs have one connected component, as opposed to the two components of the other graphs. Second, both craniofacial networks had 15 nodes compared to the 8 nodes, potentially demonstrating greater heterogeneity in craniofacial characteristics. Finally, the connectivity of the craniofacial networks is far higher than the first two networks, with the first simplicial complex having 29 edges and the second simplicial complex having 24 edges. As an edge in the K-mapper algorithm indicates a 60% overlap or more between clusters, we see more evidence that the underlying manifold for craniofacial data is harder to dissect and separate compared to survey data.
3.3 Singular Value decomposition
Singular value decomposition (SVD) is a commonly used matrix factorization method that decomposes a matrix into the product of three matrices: trefethen1997numerical
. SVD can be used as a dimension reduction technique that allows identification of directions of high variance via the left and right singular vectorsand
. The SVD is related to the well-known Principal Component Analysis (PCA). While the PCA is a linear projection of data via singular vectors, the SVD offers a full matrix factorization. For the matrix of respondents and responses, the SVD can suggest survey questions that are significant in confirming or discarding a diagnosis of obstructive sleep apnea. It also suggests ways in which presentation may be divergent among distinct groups of children. Here we examine survey data only, discarding the craniofacial data for the moment.
To apply SVD, we scale the survey data using scikit-learn’s MinMaxScaler, which scales each feature to the interval . The SVD is sensitive to the scale of variables, and without scaling, the answer to a question like “minutes of night waking” will appear as the feature that exhibits the most variance, because it takes values between 0 and 75 minutes, in contrast with all the 0-1 binary response questions on the surveys.
Visualization of the projection to the first two singular components given by singular value decomposition demonstrates the range of symptoms experienced by participants and illuminates the difficulty that we’ll see in using supervised learning methods like decision trees and support vector machines to create accurate classifiers. The projections to the first two singular components in respondent space is in Figure3(a), with respondents "no OSA" and "OSA", and projection to symptom space is in Figure 3(b). As demonstrated in this projection, control patients and symptomatic patients are not obviously separable, nor do they cluster cleanly. This observation foreshadows the poor results we will see from classification algorithms.
The projection to the first two singular components in symptom space show that just a few questions exhibit the largest variance in responses: the frequency with which a child falls asleep spontaneously while playing alone, riding in the car, eating meals, or watching television. The surprise is that while respondents with OSA tend to exhibit more of these symptoms, these symptoms do not cleanly differentiate between children with and without OSA. Pediatric OSA is a complex diagnosis that cannot be easily reduced to appearance of a subset of symptoms.
It is worth comparing the visualizations from singular value decomposition with those coming from kernel PCA, shown in Figure 6.
4 Statistical Learning Methods
Because of the high dimensionality of our data and the findings in our data exploration, we apply a variety of machine learning techniques. In this section, we discuss each of the methods used and the justification for applying them to our data sets. The performance results of those classification methods are in Section 5. We aim to classify recruited patients into two categories: No OSA and risk of OSA. In clinical practice, there are four categories for OSA - no OSA, mild OSA, moderate OSA, and severe OSA - but to establish initial diagnostic results, we start with presence or absence of OSA. Further discussion about classification of severity, using all four categories, is in Section 6.
4.1 Supervised Learning
We applied various supervised methods to see how accurately one can infer whether a subject has OSA or does not have OSA.
Supervised learning methods applied to the data include Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Logistic Regression (LR), Decision Trees (DT), Random Forests (RF), Neural Networks (NNET), and Support Vector Machines (SVM) and K-nearest Neighbour (KNN). These techniques are well-established and we provide only a cursory overview of results so that contrast with other methods can be established.
Linear Discriminant Analysis (LDA) proposed by Fisher in 1936 is the most basic and traditional classification method. Although LDA may seem too simple to handle large and complex data, it can give insight on how much improvement those more advanced machine learning methods can achieve. We use LDA as a benchmark method. Compared to LDA, Quadratic Discriminant Analysis (QDA) assumes different covariance matrices for different classes, which result in non-linear decision boundaries between classes. Logistic regression (LR) is also very popular method for classification, especially for binary classification. R packages mass Rmass and “glmnet" Rglmnet were used in the calculation.
Decision tree (DT) is a non-parametric method which classifies based on how strongly each individual feature performs in predicting ultimate diagnosis. A major benefit of DT is their interpretability, which allows the user to see which variables are most significant in the classification. The DecisionTreeClassifier from the Scikit-learn library for Python was used scikit-learn. A grid search to find the optimal cost-complexity pruning parameter was performed, using the model selection library from Scikit-learn. The optimized parameters for DT and other machine learning methods can be found in Table 1.
In addition to DT, we used random forests (RF) to select feature importance. Random forests consist of an ensemble of decision trees. At each step of growing the component decision trees, there is a different set of variables randomly selected from the whole set of variables. The purpose of this randomization is to grow many trees, which are not similar to each other. This allows the algorithm to explore the explanatory variable space as widely as possible, which often gives a more robust outcome than decision trees. In this paper, we used the default choice for the number of variables selected for each step, i.e., , the square root of the total number of variables. R package randomForest was used in these calculations RrandomForest.
Neural Networks (NNET) is the basic component of modern deep learning methods. NNET iterates between forward and backward propagation to update the weights and makes a final prediction with the non-linear function (usually, the logit function) of the linear combinations of variables. There are a number of parameters that need to be tuned in order to achieve optimal results. These tuning parameters are the number of hidden layers and the number of hidden units. Usually the tuning parameters are chosen by cross-validation method. R packagennet was used to implement NNET Rnnet.
In Support Vector Machines (SVM), a decision boundary between the two classifications is chosen by solving a convex optimization problem. Here we optimized penalty parameter and scaling parameter for both linear and exponential kernels using a cross validation grid search. SVM has given promising results in some diagnostic contexts, for instance in diagnosing Lyme disease (SVMLyme).
|Method||CF Data||Survey Data||Combined Data|
|NNET||size = 2||size = 2||size = 2|
|decay = 1.0||decay = 1.0||decay = 1.0|
|SVM||kernel - exp.||kernel - linear||kernel - linear|
|C = 1||C=0.001||C=1|
Last, we used -nearest neighbors classification (KNN). We used uniform weights as we do not wish to introduce feature importance in a benchmark measure. We used a cross validation grid search to find the optimal over values . The scikit-learn KNeighborsClassifier toolbox for nearest neighbors was used in these calculations scikit-learn.
4.2 Bayesian Classifiers
Bayesian networks are directed acyclic graph models that represent the joint distribution of a set of random variables. The random variables are represented as nodes and the dependencies between them are represented by the directed edges. There are a few advantages of Bayesian networks (see more details in sucar and uusitalo): they are (1) suitable for small and incomplete data sets, (2) combine known knowledge/priors with data (so one can use an expert’s knowledge if necessary), and (3) take into account subtle relationships between variables, while avoiding a parametric approach that makes strong assumptions on the data structure. The main disadvantage of these networks, however, is the requirement of discretization of continuous variables.
Bayesian classifiers are a particular application of Bayesian networks; they estimate the probability of each class given the predictor variables. We apply two basic classifiers, Naïve Bayes (NB) classifier, Tree augmented Bayesian (TAN) classifier, and a more recent approach, Semi-Hierarchical Bayesian (SHNB), described inshnb. NB assumes that all the attributes are independent given the class variable. This classifier can be effective due to the low number of required parameters and the low computational cost for inference and learning. However, in many applications to real life data, conditional independence may not be valid. TAN constructs a directed tree among the attributes to incorporate dependencies between attributes. That is, in TAN, each attribute depends on both the class and other attributes. We used R packages; bnlearn bnlearn and gRain gRain
to obtain posterior probabilities (inference) and the model (learning) for the NB and TAN classifiers.
As the number of attributes increases, the number of possible structures becomes larger, requiring a large data set to obtain good estimates. To overcome high dimensionality (large number of attributes), the authors in Zhang2004LatentVD introduce the Hierarchical Naive Bayesian (HNB) classier, which integrates latent (‘hidden’) variables. We applied SHNB, a variation of HNB, to our data analysis because the computations are straightforward with two steps: create latent variables using NB, then use TAN with the latent variables and the remaining attributes. To create latent variables we calculated conditional mutual information (CMI), a similarity between two variables given the class. We then obtained an undirected graph connecting all the attributes based on the CMI between every pair of variables. Lastly, we created latent variable from the variables that form maximal clique. For example, among 8 variables of craniofacial data, two maximal cliques were chosen with the threshold similarity measure: overjet & overbite and midface deficiency & lower face height. We created latent variables for overjet & overbite and midface deficiency & lower face height, termed as anterior teeth coupling and maxillomandibular facial proportion, respectively. All four attributes consisted of three levels, see Table 2. Thus, both latent variables consisted of nine categories. We combined the levels “deep bite" and “open bite" in overbite and called it not-normal (similarly the non-normal category was used for both “increase" and “reverse" in overjet). The SHNB structures were formed by the TAN classifier with two latent variables and four remaining attributes that were not involved in cliques.
We modeled NB, TAN and SHNB for two additional data sets: (1) all survey questionnaires (total of 149 variables) and (2) both survey and craniofacial index data (total of 157 variables). Five continuous variables in survey questionnaires were discretized into two groups and cut off at values: age (8 years old), # of hours of sleep (9 hours), how long waking lasts at night (12 min), # of days (last week) participating in physical activity (5 days), and OSA overall (7). For both data sets, the undirected graphs based on CMI between variables showed 3 clusters formed by Pediatric Quality of Life Inventory (PedsQ, 23 variables), PedsQL by children (23 variables) and OSA-18 (19 variables). We created each latent variable (total three) by choosing the most frequent answer for each patient. For example, OSA-18 consisted of 19 variables, so each patient had 19 answers. The most frequent answer of 19 questions was chosen. We now compare three different classifiers, NB, TAN and SHNB modeled for craniofacial index, survey data, and all variables (survey and craniofacial index), the results for which are in Section 5.
4.3 Manifold learning
In addition to the supervised learning methods, Bayesian classifiers, and correlation networks described in the previous sections, we considered manifold learning as an unsupervised learning method for classification.
Manifold learning is an approach that sees data as a sample from some low dimensional manifold even if the points themselves are embedded in a high dimension. From this perspective, clustering can be seen as an estimation of the connected components of the underlying manifold. Manifold learning approaches have been shown to have very accurate clustering results in highly non-linear scenarios vazquez2018; berry; papanikolaou2019spatial. Given the non-linearity of our data embedded in 157 dimensions, we would like to apply manifold learning to cluster the survey respondents into a control and a patient group. However, since there are only 173 useful respondents and the data itself have a mixture of continuous and categorical features, we must be careful how this approach is applied and how the results are interpreted.
For this purpose, we explored the application of manifold learning into this new setting of mixed data types. We first discuss the the challenges that need to be solved and how we have addressed them (Section 4.3). We then discuss four different density-based manifold learning methods that were applied to our data (Section 4.3).
Challenges to be Addressed
As mentioned previously, manifold learning has potential in being useful to the clustering of our mixed data. However, the nature of our data brings some questions on how appropriate such a method is for our mixed data scenario. The first challenge is that the theoretical guarantees of these types of methods usually require a large number of points diffusion; nadler2006diffusion; berry. This is because the accuracy depends on taking limits to a small scale, which for categorical data means that every point will be its own connected component. It is important to mention that although theoretical results do require a large number of points, experimental results have shown that good results are still possible with some methods vazquez2018; papanikolaou2019spatial. We conjecture that this is due to the data being close to a manifold, i.e., a manifold could be a good approximation to the given data. Therefore, we wanted to investigate if this was the case with our data.
Another issue with having a small sample size is the curse of dimensionality in manifold learningberry; vazquez2018
. This means that the higher the dimension of points to be clustered is, the exponentially more data that will be needed for accuracy. One way to get around the inaccuracies caused by the curse of dimensionality is to employ a dimension reduction method. While dimension reduction methods for categorical variables exist, we would also like to consider a different approach. It is common in manifold learning to embed the data, raw or projection, in a graph in order to increase efficiency. This step makes use of meaningful similarity measures to construct a representation of the data in graph form. We hope to get a representative graph construction directly from the categorical points by an appropriate choice of distance metric.
We now proceed to describe the manifold learning approaches used to analyze the survey data.
Density Based Clustering
A well defined way to use manifold learning for clustering purposes is using the density-based approach chaudhuri2014consistent; berry. In this approach, clusters are defined as high density areas. This method seems reasonable for our data given some of the patterns in the distance matrices, and thus density, of our data seen in Figure 4. In this figure, we looked at all the features in the data, and one can see some of the block-like patterns that arise, which correspond to groupings of points. Note that the Kernel Principal Component Analysis (Kernel PCA) projection of the data in Figure 4(a) is for visualization purposes only. We use this projection in this section in order to take into the account the non-linear characteristics of our data, something that PCA and SVD are not able to do. However, we are not able to conclude anything from them since the Kernel PCA used only applies to continuous data. We are able to conjecture that there might be a distinction of densities between the patient class and the control class. That is, the control class seems to be found mainly in the high density region, while the patient class seems to be spread throughout the space.
To cluster the survey respondents in the "patient" and "control" classes, we applied four different approaches: (1) Density-based spatial clustering of applications with noise (DBSCAN) ester1996density
, (2) Spectral clusteringvon2007tutorial, (3) Continuous k-Nearest Neighbor approach (CkNN) berry, and (4) Cut-Cluster-Classify (CCC) vazquez2018
. The tuning of each method’s parameters was done comparing the clustering F-score (harmonic mean of precision and sensitivity), i.e., the optimal parameters were chosen by which gave the highest F-score measure.
DBSCAN is a clustering method that assumes clusters differ in the density within each individual cluster. The method first finds “core points” for each cluster, which are points in the center of the cluster, and then connects close-by points to the clusters. It also labels some points as outliers. Given the fact we only want 2 classes, we placed all outliers in the class that gave the largest F-score. The two main parameters for this method are
min_sampleswhich refer to the minimum sample points to define core points. Given that we are dealing with the same number of points in all the experiments, we found
min_samples=2 to give the best F-score. The other parameter is , which deals with the distance needed to define a neighborhood. This highly depends on the distance matrix, so we studied the distribution of distance values in each distance matrix, which can be seen in Figure 5. This gave us a good range of values to try, and we chose the one that gave the optimal F-score. The optimal values can be seen in Table 3.
The main idea behind spectral clustering is constructing a graph Laplacian matrix and using its eigenvalue decomposition to cluster the data. The classical algorithm starts by constructing a normalized graph Laplacian matrix from similarity measures, calculating a few eigenvectors, and then using a k-means search on these vectors. The only parameter needed is the number of clusters to find, which for all our experiments is 2.
CkNN clustering uses a continuous scale to construct a representative graph of the data. At each scale, it finds a clustering, and then uses persistent homology to choose the best scale. The two parameters it needs are the number of points needed to define a neighborhood and the value to define the sample density :
where denotes the th nearest neighbor of point . The values with the best F-scores are given in Table 4(a).
|(a) CkNN Optimal Parameters||(b) CCC Optimal Parameters|
The Cut-Cluster-Classify method divides the labeling task into three steps. It starts by picking points that pass a density threshold to separate clusters, then clusters the remaining points, and finishes by classifying the remaining points. The two parameters it needs are the number of points to sample and the value to define the sample density defined in Equation 1. The values with the best F-scores can be found in Table 4(b).
As mentioned in the previous section, one of the most important parameters is which distance metric to use. We compared the F-scores of five different metrics for discrete points: Euclidean, Cosine, Correlation, City block (or Manhattan), and 1 minus the absolute value of the Pearson coefficient. The results can be seen in Tables 5-7. Note that the metric giving the best results vary for the different clustering methods. We choose to report the quality metrics for the Manhattan distance in Tables 8,10, and 11 since this distance seems to be consistently good for all the methods. Figure 6 shows some of the labeling results.
Given the hypothesis that the classes differ by sample density, we also estimated the sample density and found a threshold to cluster the data and compare with the other methods. The sample density we used is with a 2-norm and we used Otsu’s method otsu1979threshold to do the thresholding. This method analyzes the distribution of values and finds the modes for a threshold value.
For each method and each data subset, we assess classification success by measuring the accuracy, positive predictive value (PPV), negative predictive value (NPV), sensitivity, and specificity. These measures are standard in clinical research literature. To assess the stability of each method, we used ten different training/test splits of the cleaned data. All measures for each method are reported in the form of mean standard deviation in Tables 8, 10, and 11. (Manifold learning methods had only one measure to report as they do not have training and test splits; see Section 4.3 for more information.) We note that, when we applied QDA to the combined the survey and craniofacial data, some classes are too small to estimate the covariance matrices. Therefore, we did not obtain any results for QDA.
5.1 Survey Data
Of all the classification techniques applied to the survey data, Random forests(RF) and threshold density clustering performed the most consistently. In addition to having one of the higher accuracy scores (mean 0.77 0.04), RF had the best NPV (0.80 0.06) and the best specificity (0.84 0.05). As NPV is the proportion of controls correctly classified and the specificity is the probability a person will test positive for OSA given they do not have the condition, random forests will be a strong source for correctly ruling out OSA as a potential underlying cause of harmful symptoms. Threshold density had high accuracy (0.77457) and had a high sensitivity (0.86792) but lacked the most in specificity (0.62687), which would make it a possible complement to RF.
In PPV and sensitivity, RF fell short compared to some other methods. The highest PPV score, 0.87 0.05 came from -Nearest Neighbors, while the highest sensitivity, 0.88
0.05, came from Naive Bayes (NB). For comparison, the PPV for RF was 0.70.15 and the sensitivity was 0.65 0.10. Not only was RF less consistent in these categories than the other two methods mentioned but it was also less stable, as shown by the higher standard deviations. If one were to go on survey data alone, finding a combination of the random forests, threshold density clustering, -nearest neighbors, and naive Bayes would deliver the best results by all measures. All performance values for techniques as applied to survey data can be found in Table 8.
|Performance Measures of Classification Methods on Survey Data|
|LDA||0.61 0.09||0.49 0.10||0.71 0.12||0.58 0.15||0.64 0.08|
|LR||0.54 0.10||0.43 0.12||0.66 0.13||0.58 0.12||0.52 0.16|
|DT||0.71 0.04||0.76 0.09||0.64 0.15||0.79 0.14||0.56 0.19|
|RF||0.77 0.04||0.70 0.15||0.80 0.06||0.65 0.10||0.84 0.05|
|NNET||0.72 0.04||0.63 0.12||0.79 0.07||0.68 0.12||0.75 0.11|
|SVM||0.76 0.03||0.83 0.04||0.66 0.12||0.77 0.06||0.73 0.09|
|k-NN||0.75 0.08||0.87 0.05||0.62||0.69 0.08||0.83 0.07|
|NB||0.77 0.06||0.78 0.08||0.74 0.11||0.88 0.05||0.58 0.14|
|TAN||0.76 0.06||0.77 0.07||0.74 0.15||0.88 0.09||0.57 0.12|
|SHNB||0.75 0.05||0.78 0.08||0.70 0.13||0.85 0.07||0.60 0.11|
5.2 Craniofacial Data
Results: CF Distributions
In comparing the distributions of craniofacial variables, we found that the three variables with the largest difference between patients and controls were Palate Score, Lower Face Height, and Overjet Score. The other variables may have some difference in distribution shapes (see Table 9 and Figures 10 and 11), but overall do not appear as different when comparing control subjects and OSA subjects. This is not to say that all other craniofacial variables should be ignored; rather, clinicians should pay extra attention to variables Palate Score, Lower Face Height, and Overjet Score when evaluating a patient.
|Frequencies and Earth Mover’s Distance for Craniofacial Data Measures|
|Metric||Group||Score 0||Score 1||Score 2||EMD|
|Lower Face Height||P||0.5766||0.3604||0.0630||0.1507|
Results: Classification with Craniofacial Data
See Table 10 for the results of various classification methods on the craniofacial data set that was a subset of the combined data set.
Overall, unlike in survey data, there was no one dominant method that performed consistently in classification using craniofacial data. In fact, in every measure, a different method achieved the highest score. Support vector machines and SHNB both had a mean accuracy of about 0.72 0.05, Naive Bayes had the best PPV of 0.79 0.06, neural nets and random forests tied for NPV at 0.79 0.09, manifold learning had the best sensitivity of 0.88, and decision trees had the best specificity of 0.73 0.09. Compared to the best performances in these measures in survey data, the means are lower and the standard deviations are larger. In other words, the performance is worse and less stable.
We can conclude that, while the craniofacial data can aid in diagnosis, it cannot be used alone. On the one hand, the poor results of classification in craniofacial data may be a self-fulfilling prophesy: OSA diagnosis in the studies is based on both the questions and craniofacial measurements. Therefore, there exists points with all normal craniofacial measurements but that were still diagnosed with OSA. On the other hand, the existence of OSA patients and controls having the exact same craniofacial measurements indicates craniofacial data by itself cannot be used to distinguish the two groups.
|Performance Measures of Classification Methods on Craniofacial Data|
|LDA||0.63 0.07||0.49 0.22||0.74 0.11||0.59 0.26||0.70 0.18|
|LR||0.66 0.06||0.59 0.18||0.76 0.09||0.63 0.16||0.71 0.16|
|DT||0.70 0.06||0.77 0.09||0.61 0.13||0.76 0.10||0.73 0.09|
|RF||0.66 0.07||0.56 0.15||0.77 0.09||0.67 0.15||0.67 0.12|
|NNET||0.67 0.08||0.58 0.17||0.77 0.09||0.67 0.13||0.69 0.15|
|SVM||0.72 0.05||0.77 0.08||0.64 0.14||0.80 0.09||0.60 0.14|
|k-NN||0.65 0.07||0.76 0.09||0.54 0.19||0.67 0.13||0.64 0.19|
|NB||0.70 0.06||0.79 0.06||0.60 0.16||0.72 0.11||0.69 0.10|
|TAN||0.69 0.07||0.77 0.07||0.59 0.18||0.72 0.12||0.64 0.13|
|SHNB||0.72 0.05||0.79 0.08||0.62 0.16||0.75 0.10||0.68 0.13|
5.3 Combined Survey and Craniofacial Data
We finally analyze the combined survey and craniofacial data to see how the two work together in classification. About 58% of survey respondents whose responses were used in testing or training sets had diagnosed pediatric OSA; this is our no-information rate.
Overall, the performance of each method was similar to each method’s respective performance for survey data. (See Tables 11 and 8 respectively for comparison of combined data and survey data.) Algorithms applied to the combined data generally outperformed algorithms applied only to craniofacial data.
Random forests, as in the survey data, performed the best in the categories of accuracy, NPV, and specificity. Threshold density clustering performed at about the same level as in the survey data, once again having a high accuracy of 0.78035 and a high sensitivity of 0.87736. Naive Bayes also once again delivered the best sensitivity score, and -nearest neighbors again yielded the highest PPV score. Interestingly, -nearest neighbors improved both in average and stability for PPV, up to 0.89 0.02 in the combined data from 0.87 0.05. All the other metrics for performance either stayed about the same or slightly improved. We observe that incorporating cranio-facial data and the survey data together do give the best performance in every metric. While craniofacial data does not perform well on its own, it does contribute to OSA classification when combined with the survey data.
|Performance Measures of Classification Methods on Combined Data|
|LDA||0.65 0.09||0.53 0.12||0.77 0.12||0.70 0.15||0.63 0.09|
|LR||0.52 0.09||0.41 0.11||0.64 0.12||0.56 0.11||0.50 0.15|
|DT||0.68 0.06||0.75 0.07||0.56 0.14||0.72 0.07||0.59 0.16|
|RF||0.78 0.03||0.72 0.10||0.81 0.06||0.67 0.08||0.85 0.04|
|NNET||0.75 0.04||0.67 0.12||0.80 0.08||0.69 0.11||0.80 0.07|
|SVM||0.74 0.03||0.82 0.07||0.65 0.12||0.76 0.09||0.73 0.09|
|k-NN||0.75 0.8||0.89 0.02||0.62 0.15||0.69 0.09||0.85 0.06|
|NB||0.77 0.08||0.77 0.09||0.77 0.12||0.90 0.06||0.57 0.13|
|TAN||0.77 0.05||0.76 0.07||0.77 0.15||0.89 0.08||0.55 0.09|
|SHNB||0.76 0.05||0.78 0.08||0.72 0.14||0.86 0.06||0.60 0.12|
|Cut-Cluster-Classify (CCC)||Threshold sample density|
Tables 8, 10, and 11 show that thresholding the sample density gives higher quality clustering. Furthermore, Figure 6 illustrates that clustering by a threshold on density gives more meaningful labels than the more complex methods. For example, some methods put all points in one of the groups, which explains some of the “0,” “1,” and “nan” values reported in Tables 8, 10, and 11. From these results we can only conclude that more points are needed in order to apply a manifold learning approach, but we can not completely disregard it since there are so few points. However, these results support the hypothesis that points in the control group tend to be in a high density region and the patient class tends to spread out in the data cloud showing low density.
Figures 7 and 8 shows the important variables identified by RF for the combined and Survey data sets respectively. There is only one craniofacial variable identified as important variable, i.e., “palate score". This information is especially valuable given that random forests was one of the top performing classification techniques.
6 Conclusion and Future Research
Our results demonstrate that using inexpensive data can still make strong predictions for a binary classification of not having OSA vs being at risk for OSA. In particular, random forests, threshold-based density clustering, naïve Bayes Bayesian classifiers, and -nearest neighbors performed the best. Of those successful methods, random forests and naïve Bayes are interpretable in that we can see which variables were prioritized most in sorting. This information is valuable for clinicians in obtaining a differential diagnosis.
Using survey data and craniofacial data, we merely attempted to classify whether a subject was diagnosed with OSA or not. However, this is a simplification of the classification of OSA as mild, strong, or severe depending on the apnea hypopnea index. The next natural step in our work is to expand classification techniques to these four possible outcomes (including no diagnosis of OSA) and identify which measures indicate OSA severity.
Another interesting future direction is to repeat the algorithms but instead of combining multiple surveys, isolate each survey and run each classification method separately. In doing so, we may lend evidence to which survey is the most effective in diagnosing OSA. We would do this for both the binary OSA classification and the classification by the apnea hypopnea index.
We hope to use this information to inform an algorithm which can aid clinicians in diagnosis and personalized treatment of a child with OSA. This algorithm would be updated as we track children through treatment and follow up in their progress towards healthy sleeping patterns.