1 Introduction
Classification models are often compared to test a global null hypothesis () of one performing significantly better than other(s). There is no standard framework, nor it is clearly defined what statistical tests to use on what performance metrics. This often results in arbitrary choices as noted by Demšar (2006) with classification accuracy being used most often.
Precision is an important performance statistic, especially useful in rare class predictions. It is the probability of positive prediction conditioned on classifier results and is often calculated perclass with an average reported as a point estimate
(1) 
where is precision for class and is number of classes.
Ideally, we can compare
for different models. However, major issue is the use of same dataset to build multiple models, which results in correlated precision values. Which if not accounted for, will result in biased inference. Another issue is the use of statistical tests designed to compare proportions, precision being a conditional probability is inherently different. When comparing precision or recall, it is not uncommon to use a Zscore test. Figures
0(a) and 0(b) show the reason to avoid such comparisons. For these plots, correlated values were generated using copula based simulations with a predefined difference. Zscore test being designed to compare independent proportions has lower power compared to GS test when there is a correlation.It is also often overlooked that some classes might be of greater interest compared to others, thus requiring higher precision scores. Reporting and comparing average precision as in 1
will dilute this effect. An example is of classifying a malignant tumor from others, where we would prefer a model with higher precision for malignant class. Second example can be of desired higher precision to identify stop signs and pedestrians for an autonomous vehicle. Classwise comparisons are also advantageous when sample size is large, i.e. when even small differences can be statistically significant, this is of a special concern in todays age of big data where problems involving deep learning can easily surpass a million observations.
As McNemar’s test (McNemar, 1947)
, paired Ttest and Wilcoxon’s signed rank test
(Wilcoxon, 1945)can be used to compare classifiers based on their sensitivity, specificity and accuracy or mean of it. There is a need of appropriate statistical test that can be used to compare models using precision. As much time and effort goes into model building, model comparison should be treated the same. To best of our knowledge, there is no present literature in machine learning that reviews or introduces any tests to compare correlated precision values. However, similar studies are present for other metrics
(Aslan et al., ; Benavoli et al., 2014; Demšar, 2006; Dietterich, 1998; Joshi, 2002; Nadeau and Bengio, 2003).In this paper we present survey of statistical methods that can be used to compare classifiers based on precision. Comparisons are made on perclass basis, with methods provided to combine inference for an overall classifier comparison. Methods are introduced to compare classifiers in cross validation (single fold and times fold) settings commonly used by practitioners. We show that these methods can be used for simultaneous comparison of multiclass multiple classifiers. We also present a partial Bayesian approach to update precision when class prevalence is known and demonstrate application of these methods to compare models based on deep architectures. We intend to enrich machine learning literature by providing methods to be used for model comparison using precision. Methods presented are not intended to replace or compete with existing statistically sound methods, but to supplement them.
Next section presents an overview of statistical methods followed by empirical evaluation, our extensions and our recommendations and conclusion.
2 Statistical tests
This section introduces statistical tests that can be used for classifier comparison using precision alone. We start with standard notation and move on to introduce statistical tests based on marginal regression framework and relative precision.
2.1 Standard notation
We start by defining precision. Given two binary classifiers and , we can write the results as Table 1 and Table 2.
Predicted label  
1  0  Total  
1  a  b  
True label  0  c  d  
Total 
Predicted label  
1  0  Total  
1  e  f  
True label  0  g  h  
Total 
Estimated precision for classifier and can be written as
(2) 
It is clear from (2) that precision is the probability of making correct predictions conditioned on classifier predicted label, i.e. .
Our null hypothesis for comparing and is
(3) 
i.e. ”For a given training set, the estimated precision for classifiers and is not statistically significantly different.”
But as same dataset is used to calculate and , values compared are not independent. Also being a conditional probability, precision does not lends itself well for methods designed for proportions. A suitable test that takes these factors into account is needed. Several such tests are presented in following sections.
2.2 Methods based on marginal regression framework
Let be true label where 1 is the class label of interest and 0 otherwise, is the predicted label by classifier and is the classifier used (1 for and 0 for ). We can define and as
(4) 
(5) 
i.e. probability of true label classified correctly conditioned on predicted label for classifier one and classifier two.
Leisenring et al. (2000) used methods based on Generalized Estimating Equations (GEE) (Liang and Zeger, 1986) by restructuring the performance data of a medical diagnostic test to fit a marginal regression framework. We shall do the same, in our case, restructured data will have one row per classifier prediction. For a two classifier comparison we will have two rows per observation. Implementation details are given in Algorithm 1. A brief primer on marginal regression framework and GEE estimation in given in appendix.
As the dataset has repeated observations (multiple observations per data point), we use GEE based Generalized Linear Model (GLM) with a logit
link. Parameter estimation and associated standard errors are calculated using robust sandwich variance estimates
(Huber, 1967; White, 1980; Liang and Zeger, 1986).(6) 
For a two classifier comparison is the ratio of true prediction given classifier predicted value from vs . It describes the degree to which one classifier is more predictive of true classification than other.
Advantage of this method is the possibility of simultaneous comparison of multiple classifiers over multiple datasets. Model estimates such as odds ratio and related confidence intervals can be calculated for supplemental information.
2.2.1 Empirical Wald test
After fitting our GLM (6), Wald test for null hypothesis can be used to test .Which is equivalent to , given as
(7) 
where denominator is second(last) diagonal element of empirical variancecovariance matrix.
Reformulation of empirical Wald test statistic was given in Kosinski (2013) as
(8) 
where is the estimated regression parameter from (6), are totals from Table 1 and 2.
It is to be noted that GEE based Wald test statistic is similar to multinomial Wald statistic as . This relation is further expanded on in appendix.
2.2.2 Score test
Score test statistic based on GEE was given in Leisenring et al. (2000) as
(9) 
where is number of positive predicted labels for observation . In a two classifier setting, is the indicator variable for correct predictions. Also
(10) 
and
(11) 
is the number of observations with at least one true predicted label and is number of true predicted labels for observation.
2.3 Methods based on relative precision
True label=0  True label=1  

=1  =0  =1  =0  
=1  
=0 
and estimated precision for and is now rewritten as
(15) 
(16) 
Relative precision (RP) is defined as
(17) 
Using log transformation, variance of is estimated with . Where
(18) 
confidence intervals are then constructed as
(19) 
19 is exponentiated to obtain upper and lower limits of . Confidence intervals from (19) can be used to test
(20) 
where
(21) 
is rejected if lower confidence interval of is greater than or upper confidence interval is less than .
3 Empirical evaluation
3.1 Experimental setup
To demonstrate application feasibility of methods described in this paper, we used publically available datasets from UCI machine learning repository (Blake and Merz, 1998) with varying characteristics and sample sizes as shown in Table 4.
Dataset  Instances  Attributes  Class 

Wilt  4889  6  2 
Diabetic Retinopathy  1151  20  2 
Phishing  2456  30  2 
Bank note  1372  5  2 
Magic  19020  11  2 
Urban land cover  675  148  9 
If a dataset was not already partitioned, trainingtest split of 70%30% was used. Although most evaluations were performed using fixed training and test splits, same procedures can be adapted when using cross validation as shown in following sections.
Random forest(Breiman, 2001)
with 1000 trees and Naive Bayes were used for initial comparisons. All comparisons were implemented in R
(R Core Team, 2015; Halekoh et al., 2006; Stock and Hielscher, 2013; Hongying Dai and Cui, 2014). Sample code is made publically available (https://github.com/lgondara/prec_compare).3.2 Comparison using Generalized Score and Empirical Wald test
Perclass precision was calculated for all datasets using random forest and Naive Bayes. Values were then compared using Generalized Score (GS) and Wald test statistic (GW). Results are shown in Table 5.
Dataset 
Class  NB  RF  PGS  PGW 

Wilt 
N  0.65  0.74  <0.0001  <0.0001 
W  0.73  0.98  0.001  0.003  
Diab. Ret.  0  0.54  0.63  0.002  0.0002 
1  0.76  0.67  0.07  0.09  
Phishing  1  0.95  0.98  <0.0001  <0.0001 
1  0.92  0.96  <0.0001  <0.0001  
Bank note  0  0.83  0.99  <0.0001  0.0001 
1  0.85  0.98  <0.0001  <0.0001  
MAGIC  G  0.72  0.88  <0.0001  <0.0001 
H  0.70  0.87  <0.0001  <0.0001  
Land Cover  Asp  0.94  0.95  0.87  0.87 
Bld  0.91  0.85  0.04  0.04  
Cr  0.77  0.69  0.25  0.26  
Cnr  0.85  0.78  0.04  0.05  
Grs  0.76  0.75  0.77  0.77  
Pl  0.92  0.92  0.95  0.95  
Shd  0.82  0.79  0.48  0.48  
Sl  0.36  0.60  0.01  0.02  
Tr  0.70  0.86  0.0001  0.001 
NB: Naive Bayes (Precision), RF: Random forest (Precision), PGS: Pvalue from GS, GW: Pvalue from GS.
Lower pvalues (typically ) would signify a statistically significant difference between precision values of two classifiers. Results from GS and GW statistics agree for all comparisons. GS has more power and performs better with small sample size compared to GW as was also noticed by Leisenring et al. (2000).
3.3 Comparison based on relative precision
Concerns around use/misuse of pvalues (Nickerson, 2000; Gill, 1999; Anderson et al., 2000) can be alleviated by using relative Precision (RP) and related confidence intervals (CIs). Although if necessary, pvalue and a test statistic can be calculated as well. Comparison results using RP are shown in Table 6.
Dataset  Class  RP(95% CI)  Pvalue 

Wilt  N  0.88 (0.85,0.92)  <0.0001 
W  0.75 (0.62,0.91)  0.003  
Diab. Ret.  0  0.85 (0.78,0.92)  0.0001 
1  1.13 (0.99,1.29)  0.06  
Phishing  1  0.97 (0.96,0.98)  <0.0001 
1  0.96 (0.95,0.97)  <0.0001  
Bank note  0  0.83 (0.79,0.88)  <0.0001 
1  0.87 (0.82,0.92)  <0.0001  
MAGIC  G  0.82 (0.81,0.83)  <0.0001 
H  0.80 (0.77,0.83)  <0.0001  
Land Cover  Asp  0.99 (0.92,1.10)  0.87 
Bls  1.07 (1.0,1.15)  0.04  
Cr  1.12 (0.92,1.35)  0.26  
Cnr  1.1 (1.0,1.17)  0.05  
Grs  1.02 (0.90,1.16)  0.77  
Pl  1.01 (0.81,1.3)  0.95  
Shd  1.03 (0.94,1.14)  0.48  
Sl  0.6 (0.39,0.93)  0.02  
Tr  0.81 (0.73,0.91)  0.0002 
RP: Relative precision, 95% CI: Confidence intervals, comparisons are based on ””
Results using RP are in agreement with GW and GS. If just using RP and related confidence intervals, standard statistical interpretation can be used. CIs not including ’1’ indicate a statistical significant difference.
Another advantage of using RP is the nice graphical representation of results it lends itself to. An example of this is shown in Figure 2 with results plotted from Table 6 for first five datasets using a forest plot. Box represents point estimate with extended lines representing 95% CIs. Reference line at ’1’ is plotted for visual inspection of a statistical significant difference. Confidence intervals not overlapping the reference line are considered significant.
3.4 Combining inference
Investigators are often interested in testing a global of an overall classifier comparison. Methods presented so far provide a perclass granular control, but an overall comparison is still desired. Results in Table 5 and Table 6
replicate a multiple comparison scenario, which if not accounted for can pose a challenge to control classifier wide Type I error rate. Common methods to adjust for multiple comparisons include Family Wise Error Rate (FWER) correction or controlling for False Discovery Rate (FDR). But, in this case we also need to acknowledge dependence between pvalues, resulting from probable contribution of observations across classes. Hence, specially tailored methods to combine dependent pvalues are needed.
For dependent pvalues, the distribution of combined test statistic does not have an explicit analytical form. It is approximated using a scaled version (Li et al., 2011) with a new distribution. Satterthwaite method (Satterthwaite, 1946) was used by Hongying Dai and Cui (2014)
to derive new degrees of freedom, scaled test statistic in this case is an extension of
Lancaster (1961) and is given as(22) 
where
(23) 
and
(24) 
(25) 
(26) 
is test statistic from Lancaster (1961) and takes correlated pvalues into account. When the covariance is unknown, permutation or bootstrap methods can be used to simulate large enough sample (usually 1000) of pvalues.
A much simpler method by Simes (Simes, 1986) can be used as well. For an ordered set of pvalues, we have
(27) 
rejecting global , if for at least one . Global pvalue is then given as . Originally designed for independent pvalues, it has been shown (Sarkar and Chang, 1997) to work well for positively correlated pvalues.
To demonstrate, we focus on combining inference from one type of tests and on first five datasets where we would expect a combined test to reject global . We used pvalues from individual class comparisons using GS test. Method from Hongying Dai and Cui (2014) was used to combine pvalues as a positive correlation cannot be guaranteed. Table 7 shows the results. As expected, combined pvalues are statistically significant at . Rejecting global in concordance with individual precision comparisons.
Dataset 
Class  Pvalue  Combined Pvalue 

Wilt  N  <0.0001  <0.0001 
W  0.001  
Diabetes  0  0.0002  0.002 
1  0.07  
Phishing  1  <0.0001  <0.0001 
1  <0.0001  
Bank note  0  <0.0001  <0.0001 
1  <0.0001  
MAGIC  G  <0.0001  <0.0001 
H  <0.0001 
3.5 Multiple classifier comparison
GEE based marginal regression framework can be used to compare multiple multiclass classifiers. With a logistic regression model, using stateoftheart classifier as the reference category, we can compare the performance of new proposed models to stateoftheart. This procedure is valuable in large scale testing where we want to compare tens of classifiers to select a best fit for a problem. We have used two datasets to show its feasibility. Random forest with 50 trees (RF2) and Support Vector Machines (SVM) were added, resulting in a four classifier comparison. Results are shown in Table
8. Pvalues inform us that at least one of the precision values is statistically significantly different from others. Magnitude and size of difference can be estimated from parameter estimates/odds ratio and related CIs.
Dataset 
Class  NB  RF1  SVM  RF2  Pvalue 

Wilt 
N  0.65  0.74  0.68  0.74  <0.0001 
W  0.73  0.98  1.0  0.95  <0.0001  
Diabetic  0  0.54  0.63  0.63  0.62  0.003 
1  0.76  0.67  0.72  0.66  0.12 
Dataset 
Class  Comparison  OR  LCL  UCL 

Wilt 
N  RF1 vs NB  1.52  1.34  1.73 
N  SVM vs NB  1.12  1.03  1.22  
N  RF2 vs NB  1.51  1.33  1.71 
OR: odds ratio, LCL: Lower confidence limit, UCL: Upper confidence limit
Table 9 shows the comparison of four learning algorithms on a class of dataset Wilt using odds ratio and related 95% confidence intervals. Only one class for one dataset is used for demonstration. Wealth of information provided by these estimates cannot be emphasized enough. Inferential statements such as ”Random forest with 1000 trees has 52% (95% CI: 34%, 73%) higher chances of detecting non wilted trees compared to Naive Bayes” can be made. Odds ratio of greater than 1 confirms that model being compared is performing better than reference model.
3.6 Partial Bayesian update of precision
We introduce here a special case of Bayes law, updating precision when class prevalence is known. As with most datasets used, it is understood that they are sampled from a larger population. When a class prevalence in population is known, precision can be updated as following
(28) 
where is classifier sensitivity, is population prevalence and is classifier specificity. This update is well known in medical statistics. Precision can be significantly changed with change in class prevalence (Altman and Bland, 1994).
Classic example is from medical diagnostics where disease prevalence in a population is known and it can be used to update precision. Another example can be in object detection, as in indoor vs. outdoor images where prevalence of certain objects would be greater indoors (Tables, chairs, kettle etc.) and some outdoors (cars, buses, traffic signs etc.).
This update can be applied to most scenarios. Even using a justifiable assumption should yield better population level estimates compared to a noninformative approach. An example is shown in Table 10 where we have used diabetic retinopathy dataset to calculate precision. Then it is updated using population prevalence from (Schneider and Süveges, 2004; Lee et al., 2015). Relative precision is the recommended method to compare updated precision values. Confidence intervals are suggested to be calculated using bootstrap methods. This update can still be used if prevalence is not known by substituting a normalized prevalence rate.
Method 
Class  

NB  0  0.54  0.35 
RF  0  0.63  0.45 
NB  1  0.76  0.87 
RF  1  0.67  0.81 
is empirical precision value and is updated precision based on prevalence
3.7 Comparison using Cross Validation
Methods described in this paper have been only applied in fixed traintest split. In this section we show their applicability when using cross validation. We used GEE based GLM on fold and times repeated fold cross validation. For fold CV, we used a value of . For times repeated fold CV, a value of 10 was used for keeping fixed at 10. After saving predictions for each fold and for each classifier, datasets are vertically stacked to generate a single dataset with multiple observations per record. Then a GEE based GLM is fitted. This is a slight modification to Algorithm 1, presented as Algorithm 2. Results are reported in Table 11 using diabetic retinopathy dataset. Results from both cross validation variations agree with results using a fixed traintest split, albeit cross validation based statistical comparisons have more power in limited sample size setting. Same methods can be used for any resampling method used during CV.
Class 
10 fold CV  10 10 fold CV 

Disease  50.6 (0.0001)  50.6 (0.0001) 
NonDisease  10.2 (0.001)  10.2 (0.001) 
Numbers outside parenthesis are test statistic with pvalues inside
3.8 Application to deep architectures
With shift to deep learning, aided by availability of better hardware and larger datasets. We are at a point where models are trained and tested on tens of thousands of objects. This large sample size makes even small differences statistically significant. Also, as new models are proposed often claiming to perform better than stateoftheart, thorough comparisons are vital. Methods presented in this paper, applied on a per class basis can provide additional insights and can overcome some of the issues. This section shows the application of precision based comparison to deep architectures. We use two modified versions of deep convolution network described in Simonyan and Zisserman (2014). For a simple demonstration we use the models to classify images of cats and dogs from Kaggle dataset (cats vs dogs) (Kaggle, ). As the original model was trained on 1000 image classes including many instances of cats and dogs (Russakovsky et al., 2015),it has been modified to work as a binary classifier (Chollet, ). Two versions used differed only on dropout rate, first had a dropout rate of 0.5 and second 0.7. Precision outcomes for both classes from both models are reported in Table 12. As expected, overall accuracy of both models is very similar (90.76 and 90.88 respectively). But, the difference can be clearly seen in class breakdown where both models perform better on different classes. This type of analysis can also be used to adjust hyperparameters for optimal performance.
Class 
Model 1  Model 2  Pvalue 

dogs  0.926  0.916  0.17 
cats  0.889  0.902  0.06 
Values shown in table above are precision values compared using GS statistic
Although above described scenario is overly simplistic, it is to demonstrate the usefulness of presented methods to compare stateoftheart. More complicated comparisons such as multiple object detection/classification in images can be implemented with similar ease.
4 Replicability
High replicability of a test statistic is vital, it does not only facilitate reproducible research but is also an estimate of the degree to which random partitioning and other dataset features are related to test results. We focus on replicability as a function of test dataset proportion and an overall sample size.
We use replicability measure introduced by Bouckaert (Bouckaert and Frank, 2004; Bouckaert, 2004) based on number of rejections of , given as
(29) 
where is outcome of th experiment, is total number of experiments. is 1 if is accepted and 0 otherwise. is an indicator function which is 1 when argument is true and 0 otherwise. Above statistic can also be calculated with a simpler formula
(30) 
where is the number when is accepted and is when it is rejected.
We used GS statistic and an initial
of 600 from ”Diabetic retinopathy” dataset with varying proportion of test set and initial sample sizes. Class ”0” was used for this test. The dataset and class were chosen as the pvalue was not too small or too large to skew replicability outcomes as highly significant or nonsignificant pvalues will tend to be more replicable than marginally significant ones.
Results are shown in Figure 3. It can be seen that using full dataset of , high replicability is maintained even when test set size is just 20%. Replicability deteriorates with decreasing sample size. It is to be noticed that using 40% of original dataset, replicability falls sharply and then increases with decreasing test set proportion. This is due to increased number of failures to reject , which boosts replicability but with an opposite sign.
5 Conclusion and Recommendations
While machine learning literature is rich with evaluations and recommendations for statistical tests to compare classifiers based on classification accuracy, AUC, Fmeasure etc. It lacked a detailed study of statistical tests that can be used to compare classifiers based on precision or recall alone. Which are important performance metrics, especially for rare event classifiers. In this paper we have reviewed statistical methods based on marginal regression framework and Relative Precision. These can be used for classifier comparison using correlated precision values. We have presented empirical evaluation and implementation feasibility of these methods. As precision is usually calculated perclass,methods are presented to combine pvalues for an overall classifier comparison. When a class prevalence is known, partial Bayesian update to precision is introduced. We have shown that the methods can be used in a cross validation setting and their application to compare deep architectures.
We recommend using GS statistic or RP for comparing two classifiers. Users concerned about use/misuse of pvalues in statistical tests should use RP as results can be solely based on RP value and CIs. To simultaneously compare multiple classifiers, we recommend using GLM with GEE. Dai’s method is recommended for combining dependent pvalues over Simes’ method as it retains appropriate power even when pvalues are not positively correlated. Whenever possible, it is also recommended to use updated precision based on population prevalence.
Appendix A Multinomial Wald test is similar to GEE based empirical Wald test
Multinomial Wald statistic for comparing precision with logit transformed values is given assuming cells in Table 1 and 2 to be multinomially distributed. Using the delta method, Wald statistic for testing is given as
(31) 
For and estimated covariance matrix of and , we can rewrite (31) as
(32) 
where
(33) 
and is number of times both classifiers predicted correct (positive concordance) and when both were wrong (negative concordance). We refer readers to Kosinski (2013) for derivation of covariance matrix.
for , we have
(34) 
.
(35) 
Appendix B Generalized Estimating Equations (GEE)
If we have dependent sampling, i.e. repeated measures, matched pairs etc., we need specialized modelling approaches to account for correlation as simpler models work on an assumption of independent response. In our case this dependence arises when we build multiple models on same dataset. To model our data to test for differences in precision values using Generalized Linear Models(GLM) or more specific logistic regression, we will use Generalized Estimating Equations, which is an extension of GLMs, or specifically an estimating framework when responses are not independent. Regression models estimated using GEE are often referred to as marginal regression models or population averaged models i.e. inferences are made about population averages and term ”marginal” signifies that mean response is modelled conditional only on covariates and not on other responses or random effects.
Basic idea of GEE is to model the mean response treating within observation correlation structure as a nuisance parameter. In this framework, we don’t need to correctly specify correlation structure to get reasonable estimates for parameter coefficients and standard errors (both needed to calculate pvalues for comparison and to get magnitude of difference in our case). Main difference between a GLM in independent observations scenario and a GEE based GLM is the need of modelling covariance structure of correlated responses. Then the model is estimated using quasilikelihood rather than maximum likelihood.
Quasilikelihood estimators are estimates of quasilikelihood equations known as Generalized Estimating Equations. There is no closed form solution in general, estimation is done using an iterative process. Standard errors can be calculated using the sandwich estimator, given as
(36) 
Full details of GEE or GLMs are out of scope for this paper. We refer readers to (Agresti and Kateri, 2011) for further details. However complex it may sound, it is implemented out of the box in all statistical packages.
References
 Agresti and Kateri (2011) Alan Agresti and Maria Kateri. Categorical data analysis. Springer, 2011.
 Altman and Bland (1994) Douglas G Altman and J Martin Bland. Statistics notes: Diagnostic tests 2: predictive values. Bmj, 309(6947):102, 1994.
 Anderson et al. (2000) David R Anderson, Kenneth P Burnham, and William L Thompson. Null hypothesis testing: problems, prevalence, and an alternative. The journal of wildlife management, pages 912–923, 2000.
 (4) Ozlem Aslan, Olcay Taner Yıldız, and Ethem Alpaydın. Statistical comparison of classifiers using area under the roc curve.
 Benavoli et al. (2014) Alessio Benavoli, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri. A bayesian wilcoxon signedrank test based on the dirichlet process. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 1026–1034, 2014.
 Blake and Merz (1998) Catherine Blake and Christopher J Merz. UCI repository of machine learning databases. 1998.
 Bouckaert (2004) Remco R Bouckaert. Estimating replicability of classifier learning experiments. In Proceedings of the twentyfirst international conference on Machine learning, page 15. ACM, 2004.
 Bouckaert and Frank (2004) Remco R Bouckaert and Eibe Frank. Evaluating the replicability of significance tests for comparing learning algorithms. In PacificAsia Conference on Knowledge Discovery and Data Mining, pages 3–12. Springer, 2004.
 Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 (10) Francois Chollet. Building powerful image classification models using very little data. (https://blog.keras.io/buildingpowerfulimageclassificationmodelsusingverylittledata.html. Accessed: 20160729.
 Demšar (2006) Janez Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006.
 Dietterich (1998) Thomas G Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998.
 Gill (1999) Jeff Gill. The insignificance of null hypothesis significance testing. Political Research Quarterly, 52(3):647–674, 1999.
 Halekoh et al. (2006) Ulrich Halekoh, Søren Højsgaard, and Jun Yan. s. Journal of Statistical Software, 15(2):1–11, 2006.
 Hongying Dai and Cui (2014) J Hongying Dai and Yuehua Cui. A modified generalized fisher method for combining probabilities from dependent tests. Frontiers in genetics, 5, 2014.
 Huber (1967) Peter J Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 221–233, 1967.
 Joshi (2002) Mahesh V Joshi. On evaluating performance of classifiers for rare classes. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on, pages 641–644. IEEE, 2002.
 (18) Kaggle. Kaggle dogs vs cats competitiion. (https://www.kaggle.com/c/dogsvscats. Accessed: 20160729.
 Kosinski (2013) Andrzej S Kosinski. A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Statistics in medicine, 32(6):964–977, 2013.
 Lancaster (1961) HO Lancaster. The combination of probabilities: an application of orthonormal functions. Australian Journal of Statistics, 3(1):20–33, 1961.
 Lee et al. (2015) Ryan Lee, Tien Y Wong, and Charumathi Sabanayagam. Epidemiology of diabetic retinopathy, diabetic macular edema and related vision loss. Eye and Vision, 2(1):1, 2015.
 Leisenring et al. (2000) Wendy Leisenring, Todd Alono, and Margaret Sullivan Pepe. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics, 56(2):345–351, 2000.
 Li et al. (2011) Shaoyu Li, Barry L Williams, and Yuehua Cui. A combined pvalue approach to infer pathway regulations in eqtl mapping. Statistics and Its Interface, 4:389–401, 2011.
 Liang and Zeger (1986) KungYee Liang and Scott L Zeger. Longitudinal data analysis using generalized linear models. Biometrika, pages 13–22, 1986.
 McNemar (1947) Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947.
 Nadeau and Bengio (2003) Claude Nadeau and Yoshua Bengio. Inference for the generalization error. Machine Learning, 52(3):239–281, 2003.
 Nickerson (2000) Raymond S Nickerson. Null hypothesis significance testing: a review of an old and continuing controversy. Psychological methods, 5(2):241, 2000.
 R Core Team (2015) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015. URL https://www.Rproject.org.

Russakovsky et al. (2015)
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.
Imagenet large scale visual recognition challenge.
International Journal of Computer Vision
, 115(3):211–252, 2015.  Sarkar and Chang (1997) Sanat K Sarkar and ChungKuei Chang. The simes method for multiple hypothesis testing with positively dependent test statistics. Journal of the American Statistical Association, 92(440):1601–1608, 1997.
 Satterthwaite (1946) Franklin E Satterthwaite. An approximate distribution of estimates of variance components. Biometrics bulletin, 2(6):110–114, 1946.
 Schneider and Süveges (2004) Miklós Schneider and Ildikó Süveges. Retinopathia diabetica: magyarországi epidemiológiai adatok. Szemészet, 141:441–444, 2004.
 Simes (1986) R John Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751–754, 1986.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Stock and Hielscher (2013) C Stock and T Hielscher. Dtcompair: comparison of binary diagnostic tests in a paired study design. R package version, 1, 2013.
 White (1980) Halbert White. A heteroskedasticityconsistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: Journal of the Econometric Society, pages 817–838, 1980.
 Wilcoxon (1945) Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83, 1945.