Log In Sign Up

Classifier comparison using precision

New proposed models are often compared to state-of-the-art using statistical significance testing. Literature is scarce for classifier comparison using metrics other than accuracy. We present a survey of statistical methods that can be used for classifier comparison using precision, accounting for inter-precision correlation arising from use of same dataset. Comparisons are made using per-class precision and methods presented to test global null hypothesis of an overall model comparison. Comparisons are extended to multiple multi-class classifiers and to models using cross validation or its variants. Partial Bayesian update to precision is introduced when population prevalence of a class is known. Applications to compare deep architectures are studied.


page 1

page 2

page 3

page 4


Statistical comparison of classifiers through Bayesian hierarchical modelling

Usually one compares the accuracy of two competing classifiers via null ...

Is the Best Better? Bayesian Statistical Model Comparison for Natural Language Processing

Recent work raises concerns about the use of standard splits to compare ...

Towards Automating Precision Studies of Clone Detectors

Current research in clone detection suffers from poor ecosystems for eva...

EDC3: Ensemble of Deep-Classifiers using Class-specific Copula functions to Improve Semantic Image Segmentation

In the literature, many fusion techniques are registered for the segment...

Quantifying the Uncertainty of Precision Estimates for Rule based Text Classifiers

Rule based classifiers that use the presence and absence of key sub-stri...

Optimising data for modelling neuronal responses

In this technical note, we address an unresolved challenge in neuroimagi...

1 Introduction

Classification models are often compared to test a global null hypothesis () of one performing significantly better than other(s). There is no standard framework, nor it is clearly defined what statistical tests to use on what performance metrics. This often results in arbitrary choices as noted by Demšar (2006) with classification accuracy being used most often.

Precision is an important performance statistic, especially useful in rare class predictions. It is the probability of positive prediction conditioned on classifier results and is often calculated per-class with an average reported as a point estimate


where is precision for class and is number of classes.

Ideally, we can compare

for different models. However, major issue is the use of same dataset to build multiple models, which results in correlated precision values. Which if not accounted for, will result in biased inference. Another issue is the use of statistical tests designed to compare proportions, precision being a conditional probability is inherently different. When comparing precision or recall, it is not uncommon to use a Z-score test. Figures

0(a) and 0(b) show the reason to avoid such comparisons. For these plots, correlated values were generated using copula based simulations with a predefined difference. Z-score test being designed to compare independent proportions has lower power compared to GS test when there is a correlation.

(a) Sample size=100
(b) Sample size=1000
Figure 1: Comparing Z-score test and GS test for correlated proportions, X-axis shows percent difference being tested and Y-axis is number of times Null hypothesis was rejected at

It is also often overlooked that some classes might be of greater interest compared to others, thus requiring higher precision scores. Reporting and comparing average precision as in 1

will dilute this effect. An example is of classifying a malignant tumor from others, where we would prefer a model with higher precision for malignant class. Second example can be of desired higher precision to identify stop signs and pedestrians for an autonomous vehicle. Classwise comparisons are also advantageous when sample size is large, i.e. when even small differences can be statistically significant, this is of a special concern in todays age of big data where problems involving deep learning can easily surpass a million observations.

As McNemar’s test (McNemar, 1947)

, paired T-test and Wilcoxon’s signed rank test

(Wilcoxon, 1945)

can be used to compare classifiers based on their sensitivity, specificity and accuracy or mean of it. There is a need of appropriate statistical test that can be used to compare models using precision. As much time and effort goes into model building, model comparison should be treated the same. To best of our knowledge, there is no present literature in machine learning that reviews or introduces any tests to compare correlated precision values. However, similar studies are present for other metrics

(Aslan et al., ; Benavoli et al., 2014; Demšar, 2006; Dietterich, 1998; Joshi, 2002; Nadeau and Bengio, 2003).

In this paper we present survey of statistical methods that can be used to compare classifiers based on precision. Comparisons are made on per-class basis, with methods provided to combine inference for an overall classifier comparison. Methods are introduced to compare classifiers in cross validation (single -fold and times -fold) settings commonly used by practitioners. We show that these methods can be used for simultaneous comparison of multiclass multiple classifiers. We also present a partial Bayesian approach to update precision when class prevalence is known and demonstrate application of these methods to compare models based on deep architectures. We intend to enrich machine learning literature by providing methods to be used for model comparison using precision. Methods presented are not intended to replace or compete with existing statistically sound methods, but to supplement them.

Next section presents an overview of statistical methods followed by empirical evaluation, our extensions and our recommendations and conclusion.

2 Statistical tests

This section introduces statistical tests that can be used for classifier comparison using precision alone. We start with standard notation and move on to introduce statistical tests based on marginal regression framework and relative precision.

2.1 Standard notation

We start by defining precision. Given two binary classifiers and , we can write the results as Table 1 and Table 2.

Predicted label
1 0 Total
1  a  b  
True label 0  c  d  
Table 1: True label vs predicted label for .
Predicted label
1 0 Total
1  e  f  
True label 0  g  h  
Table 2: True label vs predicted label for .

Estimated precision for classifier and can be written as


It is clear from (2) that precision is the probability of making correct predictions conditioned on classifier predicted label, i.e. .

Our null hypothesis for comparing and is


i.e. ”For a given training set, the estimated precision for classifiers and is not statistically significantly different.”

But as same dataset is used to calculate and , values compared are not independent. Also being a conditional probability, precision does not lends itself well for methods designed for proportions. A suitable test that takes these factors into account is needed. Several such tests are presented in following sections.

2.2 Methods based on marginal regression framework

Let be true label where 1 is the class label of interest and 0 otherwise, is the predicted label by classifier and is the classifier used (1 for and 0 for ). We can define and as


i.e. probability of true label classified correctly conditioned on predicted label for classifier one and classifier two.

Leisenring et al. (2000) used methods based on Generalized Estimating Equations (GEE) (Liang and Zeger, 1986) by restructuring the performance data of a medical diagnostic test to fit a marginal regression framework. We shall do the same, in our case, restructured data will have one row per classifier prediction. For a two classifier comparison we will have two rows per observation. Implementation details are given in Algorithm 1. A brief primer on marginal regression framework and GEE estimation in given in appendix.

As the dataset has repeated observations (multiple observations per data point), we use GEE based Generalized Linear Model (GLM) with a logit

link. Parameter estimation and associated standard errors are calculated using robust sandwich variance estimates

(Huber, 1967; White, 1980; Liang and Zeger, 1986).

2:k: Number of classes
3:c: number of classifiers to be compared
4:y: outcome/target class
5:: predicted class
7:Generate id variable for each observation
9:for classifier in 1:c do
10:     Save prediction as a
11:     Save classifier name and y as b
12:     Merge a and b horizontally
13:end for
14:Stack c datasets vertically to generate a single dataset d
15:for i in 1:k do
16:     Subset dataset d with
17:     Fit a GEE based GLM with binomial link and independent working correlation matrix as shown in (6)
18:     Save required parameters
19:end for
Algorithm 1

GEE based comparison using logistic regression

For a two classifier comparison is the ratio of true prediction given classifier predicted value from vs . It describes the degree to which one classifier is more predictive of true classification than other.

Advantage of this method is the possibility of simultaneous comparison of multiple classifiers over multiple datasets. Model estimates such as odds ratio and related confidence intervals can be calculated for supplemental information.

2.2.1 Empirical Wald test

After fitting our GLM (6), Wald test for null hypothesis can be used to test .Which is equivalent to , given as


where denominator is second(last) diagonal element of empirical variance-covariance matrix.

Reformulation of empirical Wald test statistic was given in Kosinski (2013) as


where is the estimated regression parameter from (6), are totals from Table 1 and 2.

It is to be noted that GEE based Wald test statistic is similar to multinomial Wald statistic as . This relation is further expanded on in appendix.

2.2.2 Score test

Score test statistic based on GEE was given in Leisenring et al. (2000) as


where is number of positive predicted labels for observation . In a two classifier setting, is the indicator variable for correct predictions. Also




is the number of observations with at least one true predicted label and is number of true predicted labels for observation.

Simple reformulation of complicated (9) was given by Kosinski (2013) as




and is pooled precision, estimated from Table 1 and 2 as


2.3 Methods based on relative precision

Table 1 and 2 can be restructured as Table 3

True label=0 True label=1
=1 =0 =1 =0
Table 3: True label vs predicted label for and

and estimated precision for and is now rewritten as


Relative precision (RP) is defined as


Using log transformation, variance of is estimated with . Where


confidence intervals are then constructed as


19 is exponentiated to obtain upper and lower limits of . Confidence intervals from (19) can be used to test




is rejected if lower confidence interval of is greater than or upper confidence interval is less than .

3 Empirical evaluation

3.1 Experimental setup

To demonstrate application feasibility of methods described in this paper, we used publically available datasets from UCI machine learning repository (Blake and Merz, 1998) with varying characteristics and sample sizes as shown in Table 4.

Dataset Instances Attributes Class
Wilt 4889 6 2
Diabetic Retinopathy 1151 20 2
Phishing 2456 30 2
Bank note 1372 5 2
Magic 19020 11 2
Urban land cover 675 148 9
Table 4: Datasets used for evaluation

If a dataset was not already partitioned, training-test split of 70%-30% was used. Although most evaluations were performed using fixed training and test splits, same procedures can be adapted when using cross validation as shown in following sections.

Random forest(Breiman, 2001)

with 1000 trees and Naive Bayes were used for initial comparisons. All comparisons were implemented in R

(R Core Team, 2015; Halekoh et al., 2006; Stock and Hielscher, 2013; Hongying Dai and Cui, 2014). Sample code is made publically available (

3.2 Comparison using Generalized Score and Empirical Wald test

Per-class precision was calculated for all datasets using random forest and Naive Bayes. Values were then compared using Generalized Score (GS) and Wald test statistic (GW). Results are shown in Table 5.


N 0.65 0.74 <0.0001 <0.0001
W 0.73 0.98 0.001 0.003
Diab. Ret. 0 0.54 0.63 0.002 0.0002
1 0.76 0.67 0.07 0.09
Phishing -1 0.95 0.98 <0.0001 <0.0001
1 0.92 0.96 <0.0001 <0.0001
Bank note 0 0.83 0.99 <0.0001 0.0001
1 0.85 0.98 <0.0001 <0.0001
MAGIC G 0.72 0.88 <0.0001 <0.0001
H 0.70 0.87 <0.0001 <0.0001
Land Cover Asp 0.94 0.95 0.87 0.87
Bld 0.91 0.85 0.04 0.04
Cr 0.77 0.69 0.25 0.26
Cnr 0.85 0.78 0.04 0.05
Grs 0.76 0.75 0.77 0.77
Pl 0.92 0.92 0.95 0.95
Shd 0.82 0.79 0.48 0.48
Sl 0.36 0.60 0.01 0.02
Tr 0.70 0.86 0.0001 0.001

NB: Naive Bayes (Precision), RF: Random forest (Precision), P-GS: P-value from GS, GW: P-value from GS.

Table 5: Comparison of and using GS and GW

Lower p-values (typically ) would signify a statistically significant difference between precision values of two classifiers. Results from GS and GW statistics agree for all comparisons. GS has more power and performs better with small sample size compared to GW as was also noticed by Leisenring et al. (2000).

3.3 Comparison based on relative precision

Concerns around use/misuse of p-values (Nickerson, 2000; Gill, 1999; Anderson et al., 2000) can be alleviated by using relative Precision (RP) and related confidence intervals (CIs). Although if necessary, p-value and a test statistic can be calculated as well. Comparison results using RP are shown in Table 6.

Dataset Class RP(95% CI) P-value
Wilt N 0.88 (0.85,0.92) <0.0001
W 0.75 (0.62,0.91) 0.003
Diab. Ret. 0 0.85 (0.78,0.92) 0.0001
1 1.13 (0.99,1.29) 0.06
Phishing -1 0.97 (0.96,0.98) <0.0001
1 0.96 (0.95,0.97) <0.0001
Bank note 0 0.83 (0.79,0.88) <0.0001
1 0.87 (0.82,0.92) <0.0001
MAGIC G 0.82 (0.81,0.83) <0.0001
H 0.80 (0.77,0.83) <0.0001
Land Cover Asp 0.99 (0.92,1.10) 0.87
Bls 1.07 (1.0,1.15) 0.04
Cr 1.12 (0.92,1.35) 0.26
Cnr 1.1 (1.0,1.17) 0.05
Grs 1.02 (0.90,1.16) 0.77
Pl 1.01 (0.81,1.3) 0.95
Shd 1.03 (0.94,1.14) 0.48
Sl 0.6 (0.39,0.93) 0.02
Tr 0.81 (0.73,0.91) 0.0002

RP: Relative precision, 95% CI: Confidence intervals, comparisons are based on ”

Table 6: Comparison of and using Relative precision

Results using RP are in agreement with GW and GS. If just using RP and related confidence intervals, standard statistical interpretation can be used. CIs not including ’1’ indicate a statistical significant difference.

Another advantage of using RP is the nice graphical representation of results it lends itself to. An example of this is shown in Figure 2 with results plotted from Table 6 for first five datasets using a forest plot. Box represents point estimate with extended lines representing 95% CIs. Reference line at ’1’ is plotted for visual inspection of a statistical significant difference. Confidence intervals not overlapping the reference line are considered significant.

Figure 2: Forest plot for relative comparison with 95% confidence intervals

3.4 Combining inference

Investigators are often interested in testing a global of an overall classifier comparison. Methods presented so far provide a per-class granular control, but an overall comparison is still desired. Results in Table 5 and Table 6

replicate a multiple comparison scenario, which if not accounted for can pose a challenge to control classifier wide Type I error rate. Common methods to adjust for multiple comparisons include Family Wise Error Rate (FWER) correction or controlling for False Discovery Rate (FDR). But, in this case we also need to acknowledge dependence between p-values, resulting from probable contribution of observations across classes. Hence, specially tailored methods to combine dependent p-values are needed.

For dependent p-values, the distribution of combined test statistic does not have an explicit analytical form. It is approximated using a scaled version (Li et al., 2011) with a new distribution. Satterthwaite method (Satterthwaite, 1946) was used by Hongying Dai and Cui (2014)

to derive new degrees of freedom, scaled test statistic in this case is an extension of

Lancaster (1961) and is given as






is test statistic from Lancaster (1961) and takes correlated p-values into account. When the covariance is unknown, permutation or bootstrap methods can be used to simulate large enough sample (usually 1000) of p-values.

A much simpler method by Simes (Simes, 1986) can be used as well. For an ordered set of p-values, we have


rejecting global , if for at least one . Global p-value is then given as . Originally designed for independent p-values, it has been shown (Sarkar and Chang, 1997) to work well for positively correlated p-values.

To demonstrate, we focus on combining inference from one type of tests and on first five datasets where we would expect a combined test to reject global . We used p-values from individual class comparisons using GS test. Method from Hongying Dai and Cui (2014) was used to combine p-values as a positive correlation cannot be guaranteed. Table 7 shows the results. As expected, combined p-values are statistically significant at . Rejecting global in concordance with individual precision comparisons.

Class P-value Combined P-value
Wilt N <0.0001 <0.0001
W 0.001
Diabetes 0 0.0002 0.002
1 0.07
Phishing -1 <0.0001 <0.0001
1 <0.0001
Bank note 0 <0.0001 <0.0001
1 <0.0001
MAGIC G <0.0001 <0.0001
H <0.0001
Table 7: Test of global null hypothesis using combined p-values

3.5 Multiple classifier comparison

GEE based marginal regression framework can be used to compare multiple multiclass classifiers. With a logistic regression model, using state-of-the-art classifier as the reference category, we can compare the performance of new proposed models to state-of-the-art. This procedure is valuable in large scale testing where we want to compare tens of classifiers to select a best fit for a problem. We have used two datasets to show its feasibility. Random forest with 50 trees (RF2) and Support Vector Machines (SVM) were added, resulting in a four classifier comparison. Results are shown in Table

8. P-values inform us that at least one of the precision values is statistically significantly different from others. Magnitude and size of difference can be estimated from parameter estimates/odds ratio and related CIs.

Class NB RF1 SVM RF2 P-value

N 0.65 0.74 0.68 0.74 <0.0001
W 0.73 0.98 1.0 0.95 <0.0001
Diabetic 0 0.54 0.63 0.63 0.62 0.003
1 0.76 0.67 0.72 0.66 0.12
Table 8: Multiple classifier comparison

Class Comparison OR LCL UCL

N RF1 vs NB 1.52 1.34 1.73
N SVM vs NB 1.12 1.03 1.22
N RF2 vs NB 1.51 1.33 1.71

OR: odds ratio, LCL: Lower confidence limit, UCL: Upper confidence limit

Table 9: Multiple classifier comparison, odds ratio and 95% confidence intervals

Table 9 shows the comparison of four learning algorithms on a class of dataset Wilt using odds ratio and related 95% confidence intervals. Only one class for one dataset is used for demonstration. Wealth of information provided by these estimates cannot be emphasized enough. Inferential statements such as ”Random forest with 1000 trees has 52% (95% CI: 34%, 73%) higher chances of detecting non wilted trees compared to Naive Bayes” can be made. Odds ratio of greater than 1 confirms that model being compared is performing better than reference model.

3.6 Partial Bayesian update of precision

We introduce here a special case of Bayes law, updating precision when class prevalence is known. As with most datasets used, it is understood that they are sampled from a larger population. When a class prevalence in population is known, precision can be updated as following


where is classifier sensitivity, is population prevalence and is classifier specificity. This update is well known in medical statistics. Precision can be significantly changed with change in class prevalence (Altman and Bland, 1994).

Classic example is from medical diagnostics where disease prevalence in a population is known and it can be used to update precision. Another example can be in object detection, as in indoor vs. outdoor images where prevalence of certain objects would be greater indoors (Tables, chairs, kettle etc.) and some outdoors (cars, buses, traffic signs etc.).

This update can be applied to most scenarios. Even using a justifiable assumption should yield better population level estimates compared to a non-informative approach. An example is shown in Table 10 where we have used diabetic retinopathy dataset to calculate precision. Then it is updated using population prevalence from (Schneider and Süveges, 2004; Lee et al., 2015). Relative precision is the recommended method to compare updated precision values. Confidence intervals are suggested to be calculated using bootstrap methods. This update can still be used if prevalence is not known by substituting a normalized prevalence rate.

NB 0 0.54 0.35
RF 0 0.63 0.45
NB 1 0.76 0.87
RF 1 0.67 0.81

is empirical precision value and is updated precision based on prevalence

Table 10: Updating Precision using class prevalence

3.7 Comparison using Cross Validation

Methods described in this paper have been only applied in fixed train-test split. In this section we show their applicability when using cross validation. We used GEE based GLM on -fold and times repeated -fold cross validation. For -fold CV, we used a value of . For times repeated -fold CV, a value of 10 was used for keeping fixed at 10. After saving predictions for each fold and for each classifier, datasets are vertically stacked to generate a single dataset with multiple observations per record. Then a GEE based GLM is fitted. This is a slight modification to Algorithm 1, presented as Algorithm 2. Results are reported in Table 11 using diabetic retinopathy dataset. Results from both cross validation variations agree with results using a fixed train-test split, albeit cross validation based statistical comparisons have more power in limited sample size setting. Same methods can be used for any resampling method used during CV.

10 fold CV 10 10 fold CV
Disease 50.6 (0.0001) 50.6 (0.0001)
Non-Disease 10.2 (0.001) 10.2 (0.001)

Numbers outside parenthesis are test statistic with p-values inside

Table 11: Results from 10 fold CV and 10 10 fold CV
2:k: Number of classes
3:c: number of classifiers to be compared
4:y: outcome/target class
5:: predicted class
6:f: number of folds
8:Generate id variable for each observation
10:for classifier in 1:c do
11:     while folds do
12:         Save prediction as a
13:         Save classifier name and y as b
14:         Merge a and b horizontally
15:     end while
16:     Stack f datasets vertically
17:end for
18:Stack c datasets vertically to generate a single dataset d
19:for i in 1:k do
20:     Subset dataset d with
21:     Fit a GEE based GLM with binomial link and independent working correlation matrix
22:     Save required parameters
23:end for
Algorithm 2 GEE based comparison using CV and logistic regression

3.8 Application to deep architectures

With shift to deep learning, aided by availability of better hardware and larger datasets. We are at a point where models are trained and tested on tens of thousands of objects. This large sample size makes even small differences statistically significant. Also, as new models are proposed often claiming to perform better than state-of-the-art, thorough comparisons are vital. Methods presented in this paper, applied on a per class basis can provide additional insights and can overcome some of the issues. This section shows the application of precision based comparison to deep architectures. We use two modified versions of deep convolution network described in Simonyan and Zisserman (2014). For a simple demonstration we use the models to classify images of cats and dogs from Kaggle dataset (cats vs dogs) (Kaggle, ). As the original model was trained on 1000 image classes including many instances of cats and dogs (Russakovsky et al., 2015),it has been modified to work as a binary classifier (Chollet, ). Two versions used differed only on dropout rate, first had a dropout rate of 0.5 and second 0.7. Precision outcomes for both classes from both models are reported in Table 12. As expected, overall accuracy of both models is very similar (90.76 and 90.88 respectively). But, the difference can be clearly seen in class breakdown where both models perform better on different classes. This type of analysis can also be used to adjust hyper-parameters for optimal performance.

Model 1 Model 2 P-value
dogs 0.926 0.916 0.17
cats 0.889 0.902 0.06

Values shown in table above are precision values compared using GS statistic

Table 12: Comparing deep architectures using precision

Although above described scenario is overly simplistic, it is to demonstrate the usefulness of presented methods to compare state-of-the-art. More complicated comparisons such as multiple object detection/classification in images can be implemented with similar ease.

4 Replicability

High replicability of a test statistic is vital, it does not only facilitate reproducible research but is also an estimate of the degree to which random partitioning and other dataset features are related to test results. We focus on replicability as a function of test dataset proportion and an overall sample size.

We use replicability measure introduced by Bouckaert (Bouckaert and Frank, 2004; Bouckaert, 2004) based on number of rejections of , given as


where is outcome of -th experiment, is total number of experiments. is 1 if is accepted and 0 otherwise. is an indicator function which is 1 when argument is true and 0 otherwise. Above statistic can also be calculated with a simpler formula


where is the number when is accepted and is when it is rejected.

Figure 3: Reproducibility as a function of overall sample size and test set proportion for Diabetic retinopathy data(initial p-value: 0.0002)

We used GS statistic and an initial

of 600 from ”Diabetic retinopathy” dataset with varying proportion of test set and initial sample sizes. Class ”0” was used for this test. The dataset and class were chosen as the p-value was not too small or too large to skew replicability outcomes as highly significant or non-significant p-values will tend to be more replicable than marginally significant ones.

Results are shown in Figure 3. It can be seen that using full dataset of , high replicability is maintained even when test set size is just 20%. Replicability deteriorates with decreasing sample size. It is to be noticed that using 40% of original dataset, replicability falls sharply and then increases with decreasing test set proportion. This is due to increased number of failures to reject , which boosts replicability but with an opposite sign.

5 Conclusion and Recommendations

While machine learning literature is rich with evaluations and recommendations for statistical tests to compare classifiers based on classification accuracy, AUC, F-measure etc. It lacked a detailed study of statistical tests that can be used to compare classifiers based on precision or recall alone. Which are important performance metrics, especially for rare event classifiers. In this paper we have reviewed statistical methods based on marginal regression framework and Relative Precision. These can be used for classifier comparison using correlated precision values. We have presented empirical evaluation and implementation feasibility of these methods. As precision is usually calculated per-class,methods are presented to combine p-values for an overall classifier comparison. When a class prevalence is known, partial Bayesian update to precision is introduced. We have shown that the methods can be used in a cross validation setting and their application to compare deep architectures.

We recommend using GS statistic or RP for comparing two classifiers. Users concerned about use/misuse of p-values in statistical tests should use RP as results can be solely based on RP value and CIs. To simultaneously compare multiple classifiers, we recommend using GLM with GEE. Dai’s method is recommended for combining dependent p-values over Simes’ method as it retains appropriate power even when p-values are not positively correlated. Whenever possible, it is also recommended to use updated precision based on population prevalence.

Appendix A Multinomial Wald test is similar to GEE based empirical Wald test

Multinomial Wald statistic for comparing precision with logit transformed values is given assuming cells in Table 1 and 2 to be multinomially distributed. Using the delta method, Wald statistic for testing is given as


For and estimated covariance matrix of and , we can rewrite (31) as




and is number of times both classifiers predicted correct (positive concordance) and when both were wrong (negative concordance). We refer readers to Kosinski (2013) for derivation of covariance matrix.

for , we have



Equivalency of 34 and 8 can be seen when substituting


Appendix B Generalized Estimating Equations (GEE)

If we have dependent sampling, i.e. repeated measures, matched pairs etc., we need specialized modelling approaches to account for correlation as simpler models work on an assumption of independent response. In our case this dependence arises when we build multiple models on same dataset. To model our data to test for differences in precision values using Generalized Linear Models(GLM) or more specific logistic regression, we will use Generalized Estimating Equations, which is an extension of GLMs, or specifically an estimating framework when responses are not independent. Regression models estimated using GEE are often referred to as marginal regression models or population averaged models i.e. inferences are made about population averages and term ”marginal” signifies that mean response is modelled conditional only on covariates and not on other responses or random effects.

Basic idea of GEE is to model the mean response treating within observation correlation structure as a nuisance parameter. In this framework, we don’t need to correctly specify correlation structure to get reasonable estimates for parameter coefficients and standard errors (both needed to calculate p-values for comparison and to get magnitude of difference in our case). Main difference between a GLM in independent observations scenario and a GEE based GLM is the need of modelling covariance structure of correlated responses. Then the model is estimated using quasi-likelihood rather than maximum likelihood.

Quasi-likelihood estimators are estimates of quasi-likelihood equations known as Generalized Estimating Equations. There is no closed form solution in general, estimation is done using an iterative process. Standard errors can be calculated using the sandwich estimator, given as


Full details of GEE or GLMs are out of scope for this paper. We refer readers to (Agresti and Kateri, 2011) for further details. However complex it may sound, it is implemented out of the box in all statistical packages.


  • Agresti and Kateri (2011) Alan Agresti and Maria Kateri. Categorical data analysis. Springer, 2011.
  • Altman and Bland (1994) Douglas G Altman and J Martin Bland. Statistics notes: Diagnostic tests 2: predictive values. Bmj, 309(6947):102, 1994.
  • Anderson et al. (2000) David R Anderson, Kenneth P Burnham, and William L Thompson. Null hypothesis testing: problems, prevalence, and an alternative. The journal of wildlife management, pages 912–923, 2000.
  • (4) Ozlem Aslan, Olcay Taner Yıldız, and Ethem Alpaydın. Statistical comparison of classifiers using area under the roc curve.
  • Benavoli et al. (2014) Alessio Benavoli, Giorgio Corani, Francesca Mangili, Marco Zaffalon, and Fabrizio Ruggeri. A bayesian wilcoxon signed-rank test based on the dirichlet process. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1026–1034, 2014.
  • Blake and Merz (1998) Catherine Blake and Christopher J Merz. UCI repository of machine learning databases. 1998.
  • Bouckaert (2004) Remco R Bouckaert. Estimating replicability of classifier learning experiments. In Proceedings of the twenty-first international conference on Machine learning, page 15. ACM, 2004.
  • Bouckaert and Frank (2004) Remco R Bouckaert and Eibe Frank. Evaluating the replicability of significance tests for comparing learning algorithms. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 3–12. Springer, 2004.
  • Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • (10) Francois Chollet. Building powerful image classification models using very little data. ( Accessed: 2016-07-29.
  • Demšar (2006) Janez Demšar. Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30, 2006.
  • Dietterich (1998) Thomas G Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998.
  • Gill (1999) Jeff Gill. The insignificance of null hypothesis significance testing. Political Research Quarterly, 52(3):647–674, 1999.
  • Halekoh et al. (2006) Ulrich Halekoh, Søren Højsgaard, and Jun Yan. s. Journal of Statistical Software, 15(2):1–11, 2006.
  • Hongying Dai and Cui (2014) J Hongying Dai and Yuehua Cui. A modified generalized fisher method for combining probabilities from dependent tests. Frontiers in genetics, 5, 2014.
  • Huber (1967) Peter J Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 221–233, 1967.
  • Joshi (2002) Mahesh V Joshi. On evaluating performance of classifiers for rare classes. In Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on, pages 641–644. IEEE, 2002.
  • (18) Kaggle. Kaggle dogs vs cats competitiion. ( Accessed: 2016-07-29.
  • Kosinski (2013) Andrzej S Kosinski. A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Statistics in medicine, 32(6):964–977, 2013.
  • Lancaster (1961) HO Lancaster. The combination of probabilities: an application of orthonormal functions. Australian Journal of Statistics, 3(1):20–33, 1961.
  • Lee et al. (2015) Ryan Lee, Tien Y Wong, and Charumathi Sabanayagam. Epidemiology of diabetic retinopathy, diabetic macular edema and related vision loss. Eye and Vision, 2(1):1, 2015.
  • Leisenring et al. (2000) Wendy Leisenring, Todd Alono, and Margaret Sullivan Pepe. Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics, 56(2):345–351, 2000.
  • Li et al. (2011) Shaoyu Li, Barry L Williams, and Yuehua Cui. A combined p-value approach to infer pathway regulations in eqtl mapping. Statistics and Its Interface, 4:389–401, 2011.
  • Liang and Zeger (1986) Kung-Yee Liang and Scott L Zeger. Longitudinal data analysis using generalized linear models. Biometrika, pages 13–22, 1986.
  • McNemar (1947) Quinn McNemar. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika, 12(2):153–157, 1947.
  • Nadeau and Bengio (2003) Claude Nadeau and Yoshua Bengio. Inference for the generalization error. Machine Learning, 52(3):239–281, 2003.
  • Nickerson (2000) Raymond S Nickerson. Null hypothesis significance testing: a review of an old and continuing controversy. Psychological methods, 5(2):241, 2000.
  • R Core Team (2015) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2015. URL
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.

    International Journal of Computer Vision

    , 115(3):211–252, 2015.
  • Sarkar and Chang (1997) Sanat K Sarkar and Chung-Kuei Chang. The simes method for multiple hypothesis testing with positively dependent test statistics. Journal of the American Statistical Association, 92(440):1601–1608, 1997.
  • Satterthwaite (1946) Franklin E Satterthwaite. An approximate distribution of estimates of variance components. Biometrics bulletin, 2(6):110–114, 1946.
  • Schneider and Süveges (2004) Miklós Schneider and Ildikó Süveges. Retinopathia diabetica: magyarországi epidemiológiai adatok. Szemészet, 141:441–444, 2004.
  • Simes (1986) R John Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751–754, 1986.
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Stock and Hielscher (2013) C Stock and T Hielscher. Dtcompair: comparison of binary diagnostic tests in a paired study design. R package version, 1, 2013.
  • White (1980) Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica: Journal of the Econometric Society, pages 817–838, 1980.
  • Wilcoxon (1945) Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83, 1945.