Log In Sign Up

Comparing sleep studies in terms of the Apnea-Hypopnea Index

by   Marcel Młyńczak, et al.

The Apnea-Hypopnea Index (AHI) is one of the most-used parameters from the sleep study that allows assessing both the severity of obstructive sleep apnea and the reliability of new devices and methods. However, in many cases, it is compared with a reference only via a correlation coefficient, or this value is at least the most emphasized. In this paper, we discuss the limitations of such an approach and list several alternative quantitative and qualitative techniques, along with their interpretations. We also propose using the ranking function which enables consideration of various AHI values with different weights. This extends the analysis with a clinical significance assessment, which can be reliable for both adult-related and pediatric sleep studies. The "Shiny" web application, written in R, was developed to enable quick analysis for both physicians and statisticians.


MSED: a multi-modal sleep event detection model for clinical sleep analysis

Study objective: Clinical sleep analysis require manual analysis of slee...

Dreem Open Datasets: Multi-Scored Sleep Datasets to compare Human and Automated sleep staging

Sleep stage classification constitutes an important element of sleep dis...

A case study of glucose levels during sleep using fast function on scalar regression inference

Continuous glucose monitors (CGMs) are increasingly used to measure bloo...

Estimating Sleep Work Hours from Alternative Data by Segmented Functional Classification Analysis (SFCA)

Alternative data is increasingly adapted to predict human and economic b...

1 Introduction

Obstructive sleep apnea (OSA) is the most severe form of sleep-disordered breathing. It is a common health problem, moderate to severe forms affecting of men and of women, according to the latest epidemiological data [1]. This condition affects all age groups, with a prevalence of in children, with peak prevalence occurring at 2-8 years of age [2, 3, 4, 5, 6].

The gold standard study for diagnosing OSA is polysomnography (PSG), and one of the main parameters is the Apnea-Hypopnea Index (AHI). It allows assessment of the degree of sleep-disordered breathing, particularly considering the number of apneas and hypopneas within a single “average” hour during sleep. It is generally accepted and widely used. Based on the medical guidelines, there are four qualitative subranges of sleep apnea for adults [7]:

  • Normal: ,

  • Mild: ,

  • Moderate: , and

  • Severe: .

Different thresholds are used for the same classifications in the pediatric population:

  • Normal: ,

  • Mild: ,

  • Moderate: , and

  • Severe: .

PSG is a very complex and costly test. Therefore, many different abbreviated sleep studies (usually home sleep tests, HST) are being developed, and all of these need to be validated against full polysomnography. Therefore, new devices and/or methods evaluating obstructive sleep apnea (and sleep disordered breathing in general) are undergoing comparative studies wherein they are used to measure AHI, and assessed against reference equipment.

However, despite the guidelines, still the exact (raw) values are compared, very often using correlation coefficients, like in a recent meta-analysis of peripheral arterial tonometry diagnostics [8] or an assessment of portable wireless sleep monitors [9].

There is nothing inherently wrong with the analysis of numbers instead of diagnostics subranges; however, it appears that the meaning of the correlation coefficient was overlooked in those cases. Several issues should be addressed in the context of AHI comparison:

  • it does not evaluate the clinical significance of the method, i.e., how often subjects are correctly classified (as normal, mild, moderate, or severe);

  • it is susceptible to outliers and influential observations;

  • Pearson’s version should be used only for data with a normal distribution, a condition usually unfulfilled due to a limited range of physiologically relevant values and the spectrum of the invited study group;

  • its interpretation is connected with a linear model - many regression lines, with different slopes and intercepts, may produce high correlation coefficients, even if not clinically relevant (slope of 45 degrees without the intercept); and

  • there is no insight into whether the slope is slightly greater than or (to the same extent) less than 1, which might otherwise drive important conclusions on the reliability of the tested device.

Therefore, the main aims of this work are:

  • to propose and discuss other approaches to possible AHI comparison, and their mathematical and clinical interpretations, and

  • to present the Shiny web application, which implements selected methods and enables extended analysis of data.

The motivation here is simple - to show how to quantify the performance of methods for automatically estimating the AHI.

2 Materials and Methods

2.1 Quantitative approaches

Parameters of the possible approaches that still work with raw values, but do not (or to a much lesser extent) introduce the aforementioned misinterpretations, are listed below (in arbitrary order):

  • the intercept of the linear model that best fits data points;

  • the slope of another linear model with the intercept value forced to 0;

  • the p-value (or test statistics) from a Wilcoxon rank paired test (or from a paired T-test in case of normally distributed data);

  • the rho value (along with its p-value) from Spearman’s rank correlation test (not assuming a normal distribution);

  • Lin’s Concordance Correlation Coefficient (with its confidence interval limits), measuring how far the data deviate from the line of perfect concordance (line at 45 degrees)

    [10, 11];

  • the bias correction factor (from Lin’s analysis), that determines how far the best-fit line deviates from a line at 45 degrees (no deviation from the 45 degree line occurs when factor equals 1) [10, 11, 12];

  • mean difference of AHI scores and of AHI score differences on a Bland-Altman plot (means of AHI scores on the X-axis, and differences of AHI scores on the Y-axis) [13];

  • slope and intercept of the linear model that best fits data points on the modified Bland-Altman plot (in which reference values are used instead of the mean AHI scores on the X-axis) - this technique enables one to assess whether the nature of the differences’ distribution depends strictly on the reference value;

  • simple heuristic ratio - the number of data points above the

    line divided by the number of points below it; and

  • mean absolute error (MAE).

Another technique, the relative-deviation Bland-Altman plot, may also be used to complete the analysis with regards to the visual distribution of points [14].

Other visual/analytical methods are possible (like finding the distribution of absolute or square differences, Bayesian reasoning, etc.); due to their poorer interpretability from the physician’s perspective, they were not considered further.

2.2 Qualitative approaches

While the techniques listed above are mathematically correct, all of them use the numerical data directly, without supplying the right context (established subranges of AHI and their clinical interpretations). Therefore, we would like to emphasize the importance of the qualitative techniques, that can estimate “clinical significance” along with “statistical significance”. The introductory concept is presented in Fig. 1.

Figure 1: While assessing the clinical significance of the results compared to the reference, one needs to check whether the points are inside right squares, even if the numerical differences are relatively large.

The main parameters - accuracy, sensitivity, specificity, positive and negative predictive values (PPV and NPV, respectively), or Cohen’s Kappa - may then be estimated for the specific case. The accuracy is the ratio of the number of correctly classified data points to the total number of data points. Even if there is a relatively big difference, as between 16 and 28, the assignment to a group will be the same, unlike for 14 and 16, which could result in the patient receiving different treatment. The sensitivity can be calculated as the proportion of true positives; it can be extended to a multi-variables case by making the positive state the one being analyzed, and grouping all others into the negative. The extended specificity uses the same definition, with true positives replaced by true negatives.

PPV and NPV are the proportions of predicted positive and negative results in the test to the true positive and true negative results, respectively. Cohen’s Kappa is a more robust value, which describes the accuracy after removing the effect of random choice and takes into account possible imbalances in the data; all values greater than 0 mean that the method is better than a coin toss (the maximum is 1, the same as for accuracy).

Additionally, multi-class Receiver Operating Characteristic (multi-class ROC) analysis may provide the parameter of area under the curve (AUC), along with pair-wise ROC curves [15, 16].

2.3 Ranking function

Coming back to the comment that the difference between 16 and 28 may be less significant from the clinical point of view than that between 14 and 16 (which may cause different treatment to be prescribed), we would like to propose a ranking function that will allow the introduction of weights, which should be multiplied by the original difference, increasing its impact around hotspots (for standard AHI analysis, they are 5, 15 and 30) and decreasing impact in the middles of the subranges.

Of course, the definition of such a ranking function may vary. For visual purposes, we decided to set a value of at hotspots and a value of at range midpoints, with square approximations in-between. The shape of such a function is presented in Fig. 2.

Figure 2: The shape of the proposed ranking function; the X-axis value is the reference AHI and the Y-axis presents the coefficient for multiplying the difference. The shape of the ranking function between hotspots is approximated by a quadratic function.

Further generalization can take into account the case in which each of the points is near a different hotspot, with both still in the same subrange. For clarity and simplicity, we propose to apply the multiplier of , as explained below, for the extended mean absolute error (eMAE) formula (1)


where is the number of participants in the study, is the measured AHI value, is the reference value, is the iterator counting successive participants, is the ranking function, and is the function with a value of for points () in the same subrange of AHI values, and for others.

2.4 Shiny web application

The Shiny web application [17] (written in R with external packages: shiny [18], shinythemes [19], ggplot2 [20], plotly [21], DT [22], BlandAltmanLeh [23], caret [24], e1071 [25], DescTools [12], pROC [16], grid [26] and ggExtra [27]

) was developed for doctors and clinicians in order to enable work with these methods using only two vectors - reference and measurement AHI values. The home page, containing inputs (left panel), tabs (upper right) and the output display (lower right) is shown in Fig.


Figure 3: The home page of the Shiny web application, supplemental for the paper [17].

The input file should be an Excel spreadsheet with two columns comprising reference AHI values and values from a tested device. Before any file is uploaded, the tabs present the results of the analysis of the default data, coming from [28], a study assessing the agreement between the AHI parameters measured by portable monitor and by reference polysomnogram. The data are not well concordant, which facilitates presentation.

The app enables choosing all thresholds to determine which subrange the specific result is located in (by default, these are , and , as presented in the Introduction for the adult-related group). It is also possible to set the minimum and maximum of the ranking function (by default, and , respectively).

The Results section presents the most important results produced for the sample dataset using the app [17].

3 Results

As stated in the previous section, the presented sample analysis was carried out on the data from the paper [28].

A scatter-plot of 71 stored points is presented in Fig. 4

, along with the Pearson’s correlation coefficient, the p-value of the correlation test, the formula and line for the simple linear regression with intercept, and only the formula for another simple linear regression calculated without considering intercept.

Figure 4: The scatter-plot and basic statistics of the data points (in the Shiny web app, the plot is divided into two tabs: Pearson’s Correlation and Linear models).

One can note that the intercept of the first considered linear model was , even higher than the second hotspot. On the other hand, the slope of the model with the intercept value forced to 0 is .

The Spearman’s rho was about , similar to the Lin’s coefficient, which does not differ much from the other correlation measures (the bias correction factor is very close to 1), regardless of poor data concordance.

As the data did not come from a normal distribution, a Wilcoxon rank paired test should be performed. For this particular case, the p-value =

(indicating no cause to reject the null hypothesis - exact medians).

The Bland-Altman plot is presented in Fig. 5. The mean value of the differences (without using their absolute values) is and the spread calculated by the measure is (very high). This can also be observed in the Bland-Altman plot’s high percentages of relative deviation.

Figure 5: The Bland-Altman plot of the analyzed data points.

For the modified Bland-Altman (Fig. 6), the slope of the linear model was relatively high, about , showing that the difference between values is related to the actual value.

Figure 6: The modified Bland-Altman plot of the analyzed data points, along with the linear model.

The simple heuristic ratio is . Finally, , and . These numbers make a lot of sense compared to other results.

For qualitative analysis, the three main parameters were:

  • Accuracy = ,

  • Cohen’s Kappa = , and

  • Multi-class AUC = ,

showing, that even though several methods prefer to treat the differences as statistically insignificant (like for the Wilcoxon test), clinical significance is noticeable, and the range of differences indisputable.

4 Discussion

The PSG is still a gold standard in sleep research, even if it is too complex for breathing-related studies - the final analysis being based on several parameters, e.g., the apnea-hypopnea index, respiratory disturbance index, or percentage of snoring during the night, from which the first seems most popular. However, it has already been observed that simpler and more comfortable setups allow estimation of those parameters. Therefore, there is an increasing spectrum of methods and devices available to perform a HST.

The American Academy of Sleep Medicine suggested in its guidelines four types of devices: in-laboratory, technician-attended, overnight PSG (Type I); full PSG outside of the laboratory not needing a technologist’s presence (Type II); devices not recording the signals needed to determine sleep stages or sleep disruption, typically including respiratory movement and airflow, heart rate or ECG, and arterial oxygen saturation (Type III); and thpse recording 1-2 variables and without a technician, typically arterial oxygen saturation and airflow (Type IV) [29]. Therefore, for Types III and IV, there are many novel applications, sometimes even failing to be accurately classified, e.g., peripheral arterial tonometry [30], or audio-based technologies [31].

Even if using these techniques, one can measure sleep for several nights and calculate sophisticated parameters that assess the statistics over many nights. The AHI values are the starting point, often only as raw values, and not connected to clinical ranges.

It should be mentioned, that there are also studies for which the statistical analysis is reported in the correct, more specific, manner. E.g.,Yuceege et al. reported results of qualitative analysis, such as sensitivity, specificity, PPV or NPV [32].

This is consistent with the newest methodological recommendations presented by Miller et al., who stated that correlational analyses should be conducted alongside qualitative analysis during validity testing [33].

In addition to that, we proposed a list of possible parameters, consisting of well known, modified, or heuristically deduced parameters, which can be considered by physicians and statisticians, particularly for comparing and validating various techniques and methods.

Simple general interpretations of the proposed parameters are presented below.

  • For the intercept of the linear model - the closer to zero, the better.

  • For the slope of the linear model with a zero intercept - the closer to one, the better.

  • For all correlation analyses (Pearson’s, Spearman’s or Lin’s) and for the bias correction factor - the closer to one, the better.

  • A T/Wilcoxon test p-value greater than 0.05 indicates no reasons to reject the null hypothesis that two means or medians are equal.

  • The mean value of the differences (Bland-Altman plot) should be as close to zero as possible; however, one should also assess the distribution of the points vs. the mean of all pairs (there should be no relation).

  • The spread of the differences (Bland-Altman plot) should be as low as possible, probably not greater than 20.

  • The slope in the modified Bland-Altman plot should be as close to zero as possible, the simple heuristic ratio close to one, and the distribution of points below and above the Y=X line similar.

  • The smaller the MAE/eMAE, the better (but the values should be primarily used in relation to other studies).

  • The higher the accuracy, sensitivity, specificity, Cohen’s Kappa or multi-class AUC, the better.

We do not mean to endorse a single parameter over the range of options that allows considering different contexts of the analysis. Of course, all quantitative approaches remain sensible when considered only within specific subranges of AHI values; however, we omitted such analysis in order to preserve clarity and readability.

It is also important to remember that, in this type of analysis, outliers can have a very large impact. It is possible to estimate the distribution of the ”lo-factors” (coefficients indirectly assessing the distance between data points) and to choose the cut-off threshold to remove observations with higher lo-factor values [34]. However, as caution in using this option is recommended, we decided not to make it available in this version of the Shiny web app [17].

Also, the American Academy of Statistics has proposed that the Bayesian approach be used in similar research [35]. However, based on the presented distributions of AHI points in the studied populations, we think that adopting an appropriate prior distribution for such analysis could be difficult.

It should also be added that the presented consideration may be extended to different areas of research, where comparative analysis is part of the process.

5 Summary

This paper presents the ways in which AHI parameters established by new devices and methods can be compared with the gold standard, reference method. In order to speed up the analysis process, the Shiny web application was prepared for both clinicians and data scientists. An in-depth look at the results enables one to assess not only statistical, but also clinical significance.


The work is part of a short-term scholarship project (The Bekker Programme) funded by the Polish National Agency for Academic Exchange (NAWA).

The authors thank Martin Berka for linguistic adjustments.


  • [1] R. Heinzer, S. Vat, P. Marques-Vidal, H. Marti-Soler, D. Andries, N. Tobback, and P. Vollenweider, “Prevalence of sleep-disordered breathing in the general population: the HypnoLaus study,” The Lancet Respiratory Medicine, vol. 3, no. 4, pp. 310-318, 2015.
  • [2] J.C. Lumeng and R.D. Chervin, “Epidemiology of pediatric obstructive sleep apnea”, Proc Am Thorac Soc, vol. 5, no. 2, pp. 242–252, 2008.
  • [3] A.M. Li, C.T. Au R.Y. Sung RY, et al., “Ambulatory blood pressure in children with obstructive sleep apnoea: a community based study”, Thorax, vol. 63, no. 9, pp. 803–809, 2008.
  • [4] O.S. Capdevila, L. Kheirandish-Gozal, E. Dayyat, et al., “Pediatric obstructive sleep apnea: complications, management, and long-term outcomes”, Proc Am Thorac Soc, vol. 5, no. 2, pp. 274–282, 2008.
  • [5] R.B. Mitchell and J. Kelly, “Behavior, neurocognition and quality-of-life in children with sleep-disordered breathing”, Int J Pediatr Otorhinolaryngol, vol. 70, no. 3, pp. 395–406, 2006.
  • [6] H.L. Tan, D. Gozal and L. Kheirandish-Gozal, “Obstructive sleep apnea in children: a critical update”, Nat Sci Sleep, vol. 5, pp. 109–123, 2013.
  • [7] W. R. Ruehland, P. D. Rochford, F. J. O’Donoghue, R. J. Pierce, P. Singh, and A. T. Thornton, “The new AASM criteria for scoring hypopneas: impact on the apnea hypopnea index,” Sleep, vol. 32, no. 2, pp. 150-157, 2009.
  • [8] S. Yalamanchali, V. Farajian, C. Hamilton, T. R. Pott, C. G. Samuelson, and M. Friedman, “Diagnosis of obstructive sleep apnea by peripheral arterial tonometry: meta-analysis”, JAMA Otolaryngology–Head & Neck Surgery, vol. 139, no. 12, pp. 1343-1350, 2013.
  • [9] M. Younes, M. Soiferman, W. Thompson and E. Giannouli, Performance of a New Portable Wireless Sleep Monitor, Journal of Clinical Sleep Medicine, vol. 13, no. 2, pp. 245-258, 2017
  • [10] L. I. Lin, “A concordance correlation coefficient to evaluate reproducibility”, Biometrics, vol. 45, pp. 255-268, 1989.
  • [11] L. I. Lin, “A note on the concordance correlation coefficient”, Biometrics, vol. 56, pp. 324-325, 2000.
  • [12]

    A. Signorell et al., “DescTools: Tools for Descriptive Statistics”, R package version 0.99.28, 2019,
  • [13] J. M. Bland, and D. Altman, “Statistical methods for assessing agreement between two methods of clinical measurement” Lancet, vol. 327, no.8476, pp. 307-310, 1986.
  • [14] F. Seoane, S. Abtahi, F. Abtahi, L. Ellegård, G. Johannsson, I. Bosaeus, and L. C. Ward, “Mean expected error in prediction of total body water: a true accuracy comparison between bioimpedance spectroscopy and single frequency regression equations”, BioMed Research International, 656323, 2015.
  • [15]

    D. J. Hand, and R. J. Till, “A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems”, Machine Learning, vol. 45, no. 2, pp. 171–186, 2001.

  • [16] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.-C. Sanchez, and M. Muller, “pROC: an open-source package for R and S+ to analyze and compare ROC curves”, BMC Bioinformatics, vol. 12, p. 77, 2011.
  • [17]
  • [18] W. Chang, J. Cheng, J.J. Allaire, Y. Xie, and J. McPherson, “shiny: Web Application Framework for R”, R package version 1.2.0, 2018,
  • [19] W. Chang, “shinythemes: Themes for Shiny”, R package version 1.1.2, 2018,
  • [20] H. Wickham, “ggplot2: Elegant Graphics for Data Analysis”, Springer-Verlag New York, 2016.
  • [21] C. Sievert, “plotly for R”, 2016,
  • [22] Y. Xie, J. Cheng, and X. Tan, “DT: A Wrapper of the JavaScript Library ’DataTables’ ”, R package version 0.5, 2018,
  • [23] B. Lehnert, “BlandAltmanLeh: Plots (Slightly Extended) Bland-Altman Plots”, R package version 0.3.1, 2015,
  • [24] M. Kuhn et al., “caret: Classification and Regression Training”, R package version 6.0-81, 2018,
  • [25]

    D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch, “e1071: Misc Functions of the Department of Statistics”, Probability Theory Group (Formerly: E1071), TU Wien, R package version 1.7-0., 2018,
  • [26] R Core Team, “R: A language and environment for statistical computing”. R Foundation for Statistical Computing, Vienna, Austria, 2018,
  • [27] D. Attali, and C. Baker, “ggExtra: Add Marginal Histograms to ’ggplot2’, and More ’ggplot2’ Enhancements”, R package version 0.8, 2018,
  • [28] S. Nagubadi, R. Mehta, M. Abdoh, M. Nagori S. Littleton, R. Gueret and A. Tulaimat, “The Accuracy of Portable Monitoring in Diagnosing Significant Sleep Disordered Breathing in Hospitalized Patients” PLOS ONE, vol. 11, no. 12, pp. e0168073, 2016.
  • [29] V.K. Kapur, D.H. Auckley, S. Chowdhuri, D.C. Kuhlmann, R. Mehra, K. Ramar and C.G. Harrod, “Clinical practice guideline for diagnostic testing for adult obstructive sleep apnea: an American Academy of Sleep Medicine clinical practice guideline”, Journal of Clinical Sleep Medicine, vol. 13, no. 3, pp. 479-504, 2017.
  • [30] S. Yalamanchali, V. Farajian, C. Hamilton, T.R. Pott, C.G. Samuelson and M. Friedman, “Diagnosis of obstructive sleep apnea by peripheral arterial tonometry: meta-analysis”, JAMA Otolaryngology–Head and Neck Surgery, vol. 139, no. 12, pp. 1343-1350, 2013.
  • [31] M. Glos, A. Sabil, K.S. Jelavic, G. Baffet, C. Schöbel, I. Fietze, and T. Penzel, “Tracheal sound analysis for detection of sleep disordered breathing”, Somnologie, vol. 23, no. 2, pp. 80-85, 2019.
  • [32] M. Yuceege, H. Firat, A. Demir and S. Ardic, “Reliability of the Watch-PAT 200 in detecting sleep apnea in highway bus drivers”, Journal of Clinical Sleep Medicine, vol. 9 no. 4, pp. 339-344, 2013.
  • [33] J. N. Miller, P. Schulz, B. Pozehl, D. Fiedler, A. Fial and A. M. Berger, “Methodological strategies in using home sleep apnea testing in research and practice”, Sleep and Breathing, vol. 22, no. 3, pp. 569-577, 2018.
  • [34] M. M. Breunig, H. P. Kriegel, R. T. Ng and J. Sander, “LOF: identifying density-based local outliers”, In ACM sigmod record, vol. 29, no. 2, pp. 93-104, 2000.
  • [35] R. L. Wasserstein and N. A. Lazar, “The ASA’s statement on p-values: context, process, and purpose”, The American Statistician, vol. 70, no. 2, pp. 129-133, 2016.