1 Introduction
Obstructive sleep apnea (OSA) is the most severe form of sleepdisordered breathing. It is a common health problem, moderate to severe forms affecting of men and of women, according to the latest epidemiological data [1]. This condition affects all age groups, with a prevalence of in children, with peak prevalence occurring at 28 years of age [2, 3, 4, 5, 6].
The gold standard study for diagnosing OSA is polysomnography (PSG), and one of the main parameters is the ApneaHypopnea Index (AHI). It allows assessment of the degree of sleepdisordered breathing, particularly considering the number of apneas and hypopneas within a single “average” hour during sleep. It is generally accepted and widely used. Based on the medical guidelines, there are four qualitative subranges of sleep apnea for adults [7]:

Normal: ,

Mild: ,

Moderate: , and

Severe: .
Different thresholds are used for the same classifications in the pediatric population:

Normal: ,

Mild: ,

Moderate: , and

Severe: .
PSG is a very complex and costly test. Therefore, many different abbreviated sleep studies (usually home sleep tests, HST) are being developed, and all of these need to be validated against full polysomnography. Therefore, new devices and/or methods evaluating obstructive sleep apnea (and sleep disordered breathing in general) are undergoing comparative studies wherein they are used to measure AHI, and assessed against reference equipment.
However, despite the guidelines, still the exact (raw) values are compared, very often using correlation coefficients, like in a recent metaanalysis of peripheral arterial tonometry diagnostics [8] or an assessment of portable wireless sleep monitors [9].
There is nothing inherently wrong with the analysis of numbers instead of diagnostics subranges; however, it appears that the meaning of the correlation coefficient was overlooked in those cases. Several issues should be addressed in the context of AHI comparison:

it does not evaluate the clinical significance of the method, i.e., how often subjects are correctly classified (as normal, mild, moderate, or severe);

it is susceptible to outliers and influential observations;

Pearson’s version should be used only for data with a normal distribution, a condition usually unfulfilled due to a limited range of physiologically relevant values and the spectrum of the invited study group;

its interpretation is connected with a linear model  many regression lines, with different slopes and intercepts, may produce high correlation coefficients, even if not clinically relevant (slope of 45 degrees without the intercept); and

there is no insight into whether the slope is slightly greater than or (to the same extent) less than 1, which might otherwise drive important conclusions on the reliability of the tested device.
Therefore, the main aims of this work are:

to propose and discuss other approaches to possible AHI comparison, and their mathematical and clinical interpretations, and

to present the Shiny web application, which implements selected methods and enables extended analysis of data.
The motivation here is simple  to show how to quantify the performance of methods for automatically estimating the AHI.
2 Materials and Methods
2.1 Quantitative approaches
Parameters of the possible approaches that still work with raw values, but do not (or to a much lesser extent) introduce the aforementioned misinterpretations, are listed below (in arbitrary order):

the intercept of the linear model that best fits data points;

the slope of another linear model with the intercept value forced to 0;

the pvalue (or test statistics) from a Wilcoxon rank paired test (or from a paired Ttest in case of normally distributed data);

the rho value (along with its pvalue) from Spearman’s rank correlation test (not assuming a normal distribution);

Lin’s Concordance Correlation Coefficient (with its confidence interval limits), measuring how far the data deviate from the line of perfect concordance (line at 45 degrees)
[10, 11]; 
mean difference of AHI scores and of AHI score differences on a BlandAltman plot (means of AHI scores on the Xaxis, and differences of AHI scores on the Yaxis) [13];

slope and intercept of the linear model that best fits data points on the modified BlandAltman plot (in which reference values are used instead of the mean AHI scores on the Xaxis)  this technique enables one to assess whether the nature of the differences’ distribution depends strictly on the reference value;

simple heuristic ratio  the number of data points above the
line divided by the number of points below it; and 
mean absolute error (MAE).
Another technique, the relativedeviation BlandAltman plot, may also be used to complete the analysis with regards to the visual distribution of points [14].
Other visual/analytical methods are possible (like finding the distribution of absolute or square differences, Bayesian reasoning, etc.); due to their poorer interpretability from the physician’s perspective, they were not considered further.
2.2 Qualitative approaches
While the techniques listed above are mathematically correct, all of them use the numerical data directly, without supplying the right context (established subranges of AHI and their clinical interpretations). Therefore, we would like to emphasize the importance of the qualitative techniques, that can estimate “clinical significance” along with “statistical significance”. The introductory concept is presented in Fig. 1.
The main parameters  accuracy, sensitivity, specificity, positive and negative predictive values (PPV and NPV, respectively), or Cohen’s Kappa  may then be estimated for the specific case. The accuracy is the ratio of the number of correctly classified data points to the total number of data points. Even if there is a relatively big difference, as between 16 and 28, the assignment to a group will be the same, unlike for 14 and 16, which could result in the patient receiving different treatment. The sensitivity can be calculated as the proportion of true positives; it can be extended to a multivariables case by making the positive state the one being analyzed, and grouping all others into the negative. The extended specificity uses the same definition, with true positives replaced by true negatives.
PPV and NPV are the proportions of predicted positive and negative results in the test to the true positive and true negative results, respectively. Cohen’s Kappa is a more robust value, which describes the accuracy after removing the effect of random choice and takes into account possible imbalances in the data; all values greater than 0 mean that the method is better than a coin toss (the maximum is 1, the same as for accuracy).
2.3 Ranking function
Coming back to the comment that the difference between 16 and 28 may be less significant from the clinical point of view than that between 14 and 16 (which may cause different treatment to be prescribed), we would like to propose a ranking function that will allow the introduction of weights, which should be multiplied by the original difference, increasing its impact around hotspots (for standard AHI analysis, they are 5, 15 and 30) and decreasing impact in the middles of the subranges.
Of course, the definition of such a ranking function may vary. For visual purposes, we decided to set a value of at hotspots and a value of at range midpoints, with square approximations inbetween. The shape of such a function is presented in Fig. 2.
Further generalization can take into account the case in which each of the points is near a different hotspot, with both still in the same subrange. For clarity and simplicity, we propose to apply the multiplier of , as explained below, for the extended mean absolute error (eMAE) formula (1)
(1) 
where is the number of participants in the study, is the measured AHI value, is the reference value, is the iterator counting successive participants, is the ranking function, and is the function with a value of for points () in the same subrange of AHI values, and for others.
2.4 Shiny web application
The Shiny web application [17] (written in R with external packages: shiny [18], shinythemes [19], ggplot2 [20], plotly [21], DT [22], BlandAltmanLeh [23], caret [24], e1071 [25], DescTools [12], pROC [16], grid [26] and ggExtra [27]
) was developed for doctors and clinicians in order to enable work with these methods using only two vectors  reference and measurement AHI values. The home page, containing inputs (left panel), tabs (upper right) and the output display (lower right) is shown in Fig.
3.The input file should be an Excel spreadsheet with two columns comprising reference AHI values and values from a tested device. Before any file is uploaded, the tabs present the results of the analysis of the default data, coming from [28], a study assessing the agreement between the AHI parameters measured by portable monitor and by reference polysomnogram. The data are not well concordant, which facilitates presentation.
The app enables choosing all thresholds to determine which subrange the specific result is located in (by default, these are , and , as presented in the Introduction for the adultrelated group). It is also possible to set the minimum and maximum of the ranking function (by default, and , respectively).
The Results section presents the most important results produced for the sample dataset using the app [17].
3 Results
As stated in the previous section, the presented sample analysis was carried out on the data from the paper [28].
A scatterplot of 71 stored points is presented in Fig. 4
, along with the Pearson’s correlation coefficient, the pvalue of the correlation test, the formula and line for the simple linear regression with intercept, and only the formula for another simple linear regression calculated without considering intercept.
One can note that the intercept of the first considered linear model was , even higher than the second hotspot. On the other hand, the slope of the model with the intercept value forced to 0 is .
The Spearman’s rho was about , similar to the Lin’s coefficient, which does not differ much from the other correlation measures (the bias correction factor is very close to 1), regardless of poor data concordance.
As the data did not come from a normal distribution, a Wilcoxon rank paired test should be performed. For this particular case, the pvalue =
(indicating no cause to reject the null hypothesis  exact medians).
The BlandAltman plot is presented in Fig. 5. The mean value of the differences (without using their absolute values) is and the spread calculated by the measure is (very high). This can also be observed in the BlandAltman plot’s high percentages of relative deviation.
For the modified BlandAltman (Fig. 6), the slope of the linear model was relatively high, about , showing that the difference between values is related to the actual value.
The simple heuristic ratio is . Finally, , and . These numbers make a lot of sense compared to other results.
For qualitative analysis, the three main parameters were:

Accuracy = ,

Cohen’s Kappa = , and

Multiclass AUC = ,
showing, that even though several methods prefer to treat the differences as statistically insignificant (like for the Wilcoxon test), clinical significance is noticeable, and the range of differences indisputable.
4 Discussion
The PSG is still a gold standard in sleep research, even if it is too complex for breathingrelated studies  the final analysis being based on several parameters, e.g., the apneahypopnea index, respiratory disturbance index, or percentage of snoring during the night, from which the first seems most popular. However, it has already been observed that simpler and more comfortable setups allow estimation of those parameters. Therefore, there is an increasing spectrum of methods and devices available to perform a HST.
The American Academy of Sleep Medicine suggested in its guidelines four types of devices: inlaboratory, technicianattended, overnight PSG (Type I); full PSG outside of the laboratory not needing a technologist’s presence (Type II); devices not recording the signals needed to determine sleep stages or sleep disruption, typically including respiratory movement and airflow, heart rate or ECG, and arterial oxygen saturation (Type III); and thpse recording 12 variables and without a technician, typically arterial oxygen saturation and airflow (Type IV) [29]. Therefore, for Types III and IV, there are many novel applications, sometimes even failing to be accurately classified, e.g., peripheral arterial tonometry [30], or audiobased technologies [31].
Even if using these techniques, one can measure sleep for several nights and calculate sophisticated parameters that assess the statistics over many nights. The AHI values are the starting point, often only as raw values, and not connected to clinical ranges.
It should be mentioned, that there are also studies for which the statistical analysis is reported in the correct, more specific, manner. E.g.,Yuceege et al. reported results of qualitative analysis, such as sensitivity, specificity, PPV or NPV [32].
This is consistent with the newest methodological recommendations presented by Miller et al., who stated that correlational analyses should be conducted alongside qualitative analysis during validity testing [33].
In addition to that, we proposed a list of possible parameters, consisting of well known, modified, or heuristically deduced parameters, which can be considered by physicians and statisticians, particularly for comparing and validating various techniques and methods.
Simple general interpretations of the proposed parameters are presented below.

For the intercept of the linear model  the closer to zero, the better.

For the slope of the linear model with a zero intercept  the closer to one, the better.

For all correlation analyses (Pearson’s, Spearman’s or Lin’s) and for the bias correction factor  the closer to one, the better.

A T/Wilcoxon test pvalue greater than 0.05 indicates no reasons to reject the null hypothesis that two means or medians are equal.

The mean value of the differences (BlandAltman plot) should be as close to zero as possible; however, one should also assess the distribution of the points vs. the mean of all pairs (there should be no relation).

The spread of the differences (BlandAltman plot) should be as low as possible, probably not greater than 20.

The slope in the modified BlandAltman plot should be as close to zero as possible, the simple heuristic ratio close to one, and the distribution of points below and above the Y=X line similar.

The smaller the MAE/eMAE, the better (but the values should be primarily used in relation to other studies).

The higher the accuracy, sensitivity, specificity, Cohen’s Kappa or multiclass AUC, the better.
We do not mean to endorse a single parameter over the range of options that allows considering different contexts of the analysis. Of course, all quantitative approaches remain sensible when considered only within specific subranges of AHI values; however, we omitted such analysis in order to preserve clarity and readability.
It is also important to remember that, in this type of analysis, outliers can have a very large impact. It is possible to estimate the distribution of the ”lofactors” (coefficients indirectly assessing the distance between data points) and to choose the cutoff threshold to remove observations with higher lofactor values [34]. However, as caution in using this option is recommended, we decided not to make it available in this version of the Shiny web app [17].
Also, the American Academy of Statistics has proposed that the Bayesian approach be used in similar research [35]. However, based on the presented distributions of AHI points in the studied populations, we think that adopting an appropriate prior distribution for such analysis could be difficult.
It should also be added that the presented consideration may be extended to different areas of research, where comparative analysis is part of the process.
5 Summary
This paper presents the ways in which AHI parameters established by new devices and methods can be compared with the gold standard, reference method. In order to speed up the analysis process, the Shiny web application was prepared for both clinicians and data scientists. An indepth look at the results enables one to assess not only statistical, but also clinical significance.
Acknowledgment
The work is part of a shortterm scholarship project (The Bekker Programme) funded by the Polish National Agency for Academic Exchange (NAWA).
The authors thank Martin Berka for linguistic adjustments.
References
 [1] R. Heinzer, S. Vat, P. MarquesVidal, H. MartiSoler, D. Andries, N. Tobback, and P. Vollenweider, “Prevalence of sleepdisordered breathing in the general population: the HypnoLaus study,” The Lancet Respiratory Medicine, vol. 3, no. 4, pp. 310318, 2015.
 [2] J.C. Lumeng and R.D. Chervin, “Epidemiology of pediatric obstructive sleep apnea”, Proc Am Thorac Soc, vol. 5, no. 2, pp. 242–252, 2008.
 [3] A.M. Li, C.T. Au R.Y. Sung RY, et al., “Ambulatory blood pressure in children with obstructive sleep apnoea: a community based study”, Thorax, vol. 63, no. 9, pp. 803–809, 2008.
 [4] O.S. Capdevila, L. KheirandishGozal, E. Dayyat, et al., “Pediatric obstructive sleep apnea: complications, management, and longterm outcomes”, Proc Am Thorac Soc, vol. 5, no. 2, pp. 274–282, 2008.
 [5] R.B. Mitchell and J. Kelly, “Behavior, neurocognition and qualityoflife in children with sleepdisordered breathing”, Int J Pediatr Otorhinolaryngol, vol. 70, no. 3, pp. 395–406, 2006.
 [6] H.L. Tan, D. Gozal and L. KheirandishGozal, “Obstructive sleep apnea in children: a critical update”, Nat Sci Sleep, vol. 5, pp. 109–123, 2013.
 [7] W. R. Ruehland, P. D. Rochford, F. J. O’Donoghue, R. J. Pierce, P. Singh, and A. T. Thornton, “The new AASM criteria for scoring hypopneas: impact on the apnea hypopnea index,” Sleep, vol. 32, no. 2, pp. 150157, 2009.
 [8] S. Yalamanchali, V. Farajian, C. Hamilton, T. R. Pott, C. G. Samuelson, and M. Friedman, “Diagnosis of obstructive sleep apnea by peripheral arterial tonometry: metaanalysis”, JAMA Otolaryngology–Head & Neck Surgery, vol. 139, no. 12, pp. 13431350, 2013.
 [9] M. Younes, M. Soiferman, W. Thompson and E. Giannouli, Performance of a New Portable Wireless Sleep Monitor, Journal of Clinical Sleep Medicine, vol. 13, no. 2, pp. 245258, 2017
 [10] L. I. Lin, “A concordance correlation coefficient to evaluate reproducibility”, Biometrics, vol. 45, pp. 255268, 1989.
 [11] L. I. Lin, “A note on the concordance correlation coefficient”, Biometrics, vol. 56, pp. 324325, 2000.

[12]
A. Signorell et al., “DescTools: Tools for Descriptive Statistics”, R package version 0.99.28, 2019,
https://cran.rproject.org/package=DescTools.  [13] J. M. Bland, and D. Altman, “Statistical methods for assessing agreement between two methods of clinical measurement” Lancet, vol. 327, no.8476, pp. 307310, 1986.
 [14] F. Seoane, S. Abtahi, F. Abtahi, L. Ellegård, G. Johannsson, I. Bosaeus, and L. C. Ward, “Mean expected error in prediction of total body water: a true accuracy comparison between bioimpedance spectroscopy and single frequency regression equations”, BioMed Research International, 656323, 2015.

[15]
D. J. Hand, and R. J. Till, “A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems”, Machine Learning, vol. 45, no. 2, pp. 171–186, 2001.
 [16] X. Robin, N. Turck, A. Hainard, N. Tiberti, F. Lisacek, J.C. Sanchez, and M. Muller, “pROC: an opensource package for R and S+ to analyze and compare ROC curves”, BMC Bioinformatics, vol. 12, p. 77, 2011.
 [17] https://mmlynczak.shinyapps.io/AHIComparison/
 [18] W. Chang, J. Cheng, J.J. Allaire, Y. Xie, and J. McPherson, “shiny: Web Application Framework for R”, R package version 1.2.0, 2018, https://CRAN.Rproject.org/package=shiny.
 [19] W. Chang, “shinythemes: Themes for Shiny”, R package version 1.1.2, 2018, https://CRAN.Rproject.org/package=shinythemes.
 [20] H. Wickham, “ggplot2: Elegant Graphics for Data Analysis”, SpringerVerlag New York, 2016.
 [21] C. Sievert, “plotly for R”, 2016, https://plotlybook.cpsievert.me.
 [22] Y. Xie, J. Cheng, and X. Tan, “DT: A Wrapper of the JavaScript Library ’DataTables’ ”, R package version 0.5, 2018, https://CRAN.Rproject.org/package=DT.
 [23] B. Lehnert, “BlandAltmanLeh: Plots (Slightly Extended) BlandAltman Plots”, R package version 0.3.1, 2015, https://CRAN.Rproject.org/package=BlandAltmanLeh.
 [24] M. Kuhn et al., “caret: Classification and Regression Training”, R package version 6.081, 2018, https://CRAN.Rproject.org/package=caret.

[25]
D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, and F. Leisch, “e1071: Misc Functions of the Department of Statistics”, Probability Theory Group (Formerly: E1071), TU Wien, R package version 1.70., 2018,
https://CRAN.Rproject.org/package=e1071.  [26] R Core Team, “R: A language and environment for statistical computing”. R Foundation for Statistical Computing, Vienna, Austria, 2018, https://www.Rproject.org/.
 [27] D. Attali, and C. Baker, “ggExtra: Add Marginal Histograms to ’ggplot2’, and More ’ggplot2’ Enhancements”, R package version 0.8, 2018, https://CRAN.Rproject.org/package=ggExtra.
 [28] S. Nagubadi, R. Mehta, M. Abdoh, M. Nagori S. Littleton, R. Gueret and A. Tulaimat, “The Accuracy of Portable Monitoring in Diagnosing Significant Sleep Disordered Breathing in Hospitalized Patients” PLOS ONE, vol. 11, no. 12, pp. e0168073, 2016.
 [29] V.K. Kapur, D.H. Auckley, S. Chowdhuri, D.C. Kuhlmann, R. Mehra, K. Ramar and C.G. Harrod, “Clinical practice guideline for diagnostic testing for adult obstructive sleep apnea: an American Academy of Sleep Medicine clinical practice guideline”, Journal of Clinical Sleep Medicine, vol. 13, no. 3, pp. 479504, 2017.
 [30] S. Yalamanchali, V. Farajian, C. Hamilton, T.R. Pott, C.G. Samuelson and M. Friedman, “Diagnosis of obstructive sleep apnea by peripheral arterial tonometry: metaanalysis”, JAMA Otolaryngology–Head and Neck Surgery, vol. 139, no. 12, pp. 13431350, 2013.
 [31] M. Glos, A. Sabil, K.S. Jelavic, G. Baffet, C. Schöbel, I. Fietze, and T. Penzel, “Tracheal sound analysis for detection of sleep disordered breathing”, Somnologie, vol. 23, no. 2, pp. 8085, 2019.
 [32] M. Yuceege, H. Firat, A. Demir and S. Ardic, “Reliability of the WatchPAT 200 in detecting sleep apnea in highway bus drivers”, Journal of Clinical Sleep Medicine, vol. 9 no. 4, pp. 339344, 2013.
 [33] J. N. Miller, P. Schulz, B. Pozehl, D. Fiedler, A. Fial and A. M. Berger, “Methodological strategies in using home sleep apnea testing in research and practice”, Sleep and Breathing, vol. 22, no. 3, pp. 569577, 2018.
 [34] M. M. Breunig, H. P. Kriegel, R. T. Ng and J. Sander, “LOF: identifying densitybased local outliers”, In ACM sigmod record, vol. 29, no. 2, pp. 93104, 2000.
 [35] R. L. Wasserstein and N. A. Lazar, “The ASA’s statement on pvalues: context, process, and purpose”, The American Statistician, vol. 70, no. 2, pp. 129133, 2016.