1 Introduction
In a prevalence estimation problem, one is presented with a sample of unlabelled instances (the test sample) and is asked to estimate the distribution of the labels in the sample. If the problem sits in a binary twoclass context, all instances belong to exactly one of two possible classes and, accordingly, can be labelled either positive or negative. The distribution of the labels then is characterised by the prevalence (i.e. proportion) of the positive labels (‘class prevalence’ for short) in the test sample. However, the labels are latent at estimation time such that the class prevalence cannot be determined by simple inspection of the labels. Instead the class prevalence can only be inferred from the features of the instances in the sample, i.e. from observable covariates of the labels. The interrelationship between features and labels must be learnt from a training sample of labelled instances in another step before the class prevalence of the positive labels in the test sample can be estimated.
This whole process is called ‘supervised prevalence estimation’ (Barranquero et al., 2013), ‘quantification’ (Forman, 2008), ‘class distribution estimation’ (GonzálezCastro et al., 2013) or ‘class prior estimation’ (Du Plessis et al., 2017) in the literature. See González et al. (2017) for a recent overview of the quantification problem and approaches to deal with it. The emergence of further recent papers with new proposals of prevalence estimation methods suggests that the subject is still of high interest for both researchers and practitioners (Castaño et al., 2018; Keith and O’Connor, 2018; Maletzke et al., 2019; Vaz et al., 2019).
A variety of different methods for prevalence point estimation has been proposed and a considerable number of comparative studies for such methods has been published in the literature (González et al., 2017). But the question of how to construct confidence and prediction intervals for class prevalences seems to have attracted less attention. Hopkins and King (2010) routinely provided confidence intervals for their estimates “via standard bootstrapping procedures”, without commenting much on details of the procedures or on any issues encountered with them. Keith and O’Connor (2018) proposed and compared a number of methods for constructing such confidence intervals. Some of these methods involve MonteCarlo simulation and some do not. Also Daughton and Paul (2019) proposed a new method for constructing bootstrap confidence intervals and compared its results with the confidence intervals based on popular prevalence estimation methods. Vaz et al. (2019) introduced the ‘ratio estimator’ for class prevalences and used its asymptotic properties for determining confidence intervals without involving MonteCarlo techniques.
This paper presents a simulation study that seeks to illustrate some observations from these previous papers on confidence intervals for class prevalences in the binary case and to provide answers to some questions begged in the papers:

Would it be worthwhile to distinguish confidence and prediction intervals for class prevalences and deploy different methods for their estimation? This question is raised against the backdrop that for instance Keith and O’Connor (2018) talked about estimating confidence intervals but in fact constructed prediction intervals which are conceptionally different (Meeker et al., 2017).

Would it be worthwhile to base class prevalence estimation on more accurate classifiers? The background for this question are conflicting statements in the literature as to the benefit of using accurate classifiers for prevalence estimation. On the one hand, Forman (2008, p. 168)
stated: “A major benefit of sophisticated methods for quantification is that a much less accurate classifier can be used to obtain reasonably precise quantification estimates. This enables some applications of machine learning to be deployed where otherwise the raw classification accuracy would be unacceptable or the training effort too great.” As an example for the opposite position, on the other hand,
Barranquero et al. (2015, p. 595) commented with respect to prevalence estimation: “We strongly believe that it is also important for the learner to consider the classification performance as well. Our claim is that this aspect is crucial to ensure a minimum level of confidence for the deployed models.” 
Which prevalence estimation methods show the best performance with respect to the construction of as short as possible confidence intervals for class prevalences?

Do nonsimulation approaches to the construction of confidence intervals for class prevalences work?
In addition, this paper introduces two new methods for class prevalence estimation which are specifically designed for delivering as short as possible confidence intervals.
Deploying a simulation study for finding answers to the above questions has some advantages compared to working with realworld data:

The true class prevalences are known and can even be chosen with a view to facilitate obtaining clear answers.

The setting of the study can be freely modified – say with regard to samples sizes or accuracy of the involved classifiers – in order to more precisely investigate the topics in question.

In a simulation study, it is easy to apply an ablation approach to assess the relative impact of factors that influence the performance of methods for estimating confidence intervals.

The results can be easily replicated.

Simulation studies are good for delivering counterexamples. A method performing poorly in the study reported in this paper may be considered unlikely to perform much better in complex realworld settings.
Naturally, these advantages are bought at the cost of accepting certain obvious drawbacks:

Most findings of the study are suggestive and illustrative only. No firm conclusions can be drawn from them.

Important features of the problem which only occur in realworld situations might be overlooked.

The prevalence estimation problem primarily is caused by data set shift. For capacity reasons, the scope of the simulation study in this paper is restricted to prior probability shift^{1}^{1}1In the literature, prior probability shift is known under a number of different names, for instance ‘target shift’ (Zhang et al., 2013), ‘global drift’ (Hofer and Krempl, 2013), or ‘label shift’ (Lipton et al., 2018). See MorenoTorres et al. (2012) for a categorisation of types of data set shift., a special type of data set shift.
With these qualifications in mind, the main findings of this paper can be summarised as follows:

Extra efforts to construct prediction intervals instead of confidence intervals for class prevalences appear to be unnecessary.

‘Error Adjusted Bootstrapping’ as proposed by Daughton and Paul (2019) for the construction of prevalence confidence or prediction intervals may fail in the presence of prior probability shift.

Deploying more accurate^{2}^{2}2In this paper, instead of accuracy also the term ‘discriminatory power’ is used. Similarly, instead of ‘accurate’ the adjective ‘powerful’ is employed. classifiers for class prevalence estimation results in shorter confidence intervals.

Compared to the other estimation methods considered in this paper, straightforward ‘adjusted classify & count‘ methods for prevalence estimation (Forman, 2008
, called ‘confusion matrix method’ in
Saerens et al., 2001) without any further tuning produce the longest confidence intervals and hence, given identical coverage, perform worst. Methods based on minimisation of the Hellinger distance (GonzálezCastro et al., 2013, with different numbers of bins) produce much shorter confidence intervals, but sometimes do not guarantee sufficient coverage. The maximum likelihood approach (with bootstrapping for the confidence intervals) and ‘adjusted probabilistic classify & count‘ (Bella et al., 2010, called there ‘scaled probability average’) appear to stably produce the shortest confidence intervals among the methods considered in the paper.
The paper is organised as follows:

Section 3 ‘Results of the simulation study’ provides some tables with results of the study and comments on the results, in order to explore the questions stated above. Results in subsection 3.3 show that certain standard nonsimulation approaches cannot take into account estimation uncertainty in the training sample and that bootstrapbased construction of confidence intervals could be used instead.

Section 4 ‘Conclusions’ wraps up and closes the paper.

In Appendix A ‘Particulars for the implementation of the simulation study’, the mathematical details needed for coding the simulation study are listed.
The calculations of the simulation study have been performed by making use of the statistical software R (R Core Team, 2014). The Rscripts utilised can be downloaded at URL https://www.researchgate.net/profile/Dirk_Tasche.
2 Setting of the simulation study
The setup of the simulation study is intended to reflect the situation that occurs when a prevalence estimation problem as described in Section 1 has to be solved:

There is a training sample of observations of features and class labels for instances^{3}^{3}3Instances with label belong to the negative class, instances with label belong to the positive class.. By assumption, this sample was generated from a joined distribution
(the training population distribution) of the feature random variable
and the label (or class) random variable . 
There is a test sample of observations of features for instances. By assumption, each instance has a latent class label , and both the features and the labels were generated from a joined distribution (the test population distribution) of the feature random variable and the label random variable .
The prevalence estimation or quantification problem then is to estimate the prevalence of the positive class labels in the test population. Of course, this is only a problem if there is data set shift, i.e. if and as a likely consequence .
This paper deals only with the situation where the training population distribution and the test population distribution are related by prior probability shift which means in mathematical terms that
(2.1) 
for all subsets of the feature space such that and are welldefined.
2.1 The model for the simulation study
The classical binormal model with equal variances fits well into the prior probability shift setting for prevalence estimation of this paper.
Kawakubo et al. (2016)used it as part of their experiments for comparing the performance of prevalence methods. Logistic regression is a natural and optimal approach to the estimation of the binormal model with equal variances
(Section 6.1, Cramer, 2003). Hence when logistic regression is used for the estimation of the model in the simulation study, there is no need to worry about the results being invalidated by the deployment of a suboptimal regression or classification technique. The binormal model is specified by defining the two classconditional feature distributions and respectively.Training population distribution. Both classconditional feature distributions are normal, with equal variances, i.e.  
(2.2a)  
with and . 
Test population distribution. Same as the training population distribution, with replaced by , in order to satisfy the assumption (2.1) on prior probability shift between training and test times.
For the sake of brevity, in the following the setting with (2.2a) for both the training and the test sample is referred to as ‘double’ binormal setting.
Given the classconditional population distributions as specified in (2.2a), the unconditional training and test population distributions can be represented as
(2.2b) 
with and as parameters whose values in the course of the simulation study are selected depending on the purposes of the specific numerical experiments.
Control parameters. For this paper’s numerical experiments, the values for the parametrisation of the model are selected from the ranges specified in the following list:

is the prevalence of the positive class in the training population.

is the size of training sample. In the case , the training sample is considered identical with the training population and learning of the model is unnecessary. In the case of a finite training sample, the number of instances with positive labels is nonrandom in order to reflect the fact that for model development purposes a predefined stratification of the training sample might be desirable and can be achieved by undersampling of the majority class or by oversampling of the minority class. then is the size of the training subsample with positive labels, and is the size of the training subsample with negative labels. Hence it holds that , und for finite .

is the prevalence of the positive class in the test population.

is the size of the test sample. In the test sample, the number of instances with positive labels is random.

The population distribution underlying the features of the negativeclass training subsample is always with and . The population distribution underlying the features of the positiveclass training subsample is with and .

The population distribution underlying the features of the negativeclass test subsample is always with and . The population distribution underlying the features of the positiveclass test subsample is with and .

The number of simulation runs in all of the experiments is , i.e. times a training sample and a test sample as specified above are generated and subjected to some estimation procedures.

The number of bootstrap iterations where needed in any of the interval estimation procedures is always (Davison and Hinkley, 1997).

All confidence and prediction intervals are constructed at confidence level.
Choosing in one of the following simulation experiments will reflect a situation where no accurate classifier can be found, as it is suggested by the fact that then the AUC (area under the curve) of the feature taken as a soft classifier is^{4}^{4}4
denotes the standard normal distribution function.
. In the case the same soft classifier is very accurate with an AUC of . The different performance of the classifier depending on the value of parameter is also demonstrated in Figure 1 by the ROCs (receiver operating characteristics) corresponding to the two values (‘low power’) and (‘high power’).For the sake of completeness, it is also noted that the featureconditional class probability under the training population distribution is given by
(2.3a)  
with and . For the density ratio under both the training and test population distributions one obtains^{5}^{5}5 denotes the density function of the onedimensional normal distribution with mean .  
(2.3b) 
2.2 Methods for prevalence estimation considered in this paper
The following criteria have been applied for the selection of the methods deployed in the simulation study:

The methods must be Fisher consistent in the sense of Tasche (2017). This criterion excludes for instance ‘classify & count’ (Forman, 2008), the ‘Qmeasure’ approach (Barranquero et al., 2013)
and the distanceminimisation approaches based on the Inner Product, KumarHassebrook, Cosine, and Harmonic Mean distances mentioned in
Maletzke et al. (2019). 
The methods should enjoy some popularity in the literature.

Two new methods based on already established methods and designed to minimise the lengths of confidence intervals are introduced and tested.
According to these criteria the following prevalence estimation methods have been included in the simulation study:

ACC50: Adjusted Classify & Count (ACC: Gart and Buck, 1966; Saerens et al., 2001; Forman, 2008
), based on the Bayes classifier that minimises accuracy. ‘50’ because if the Bayes classifier is represented by means of the posterior probability of the positive class and a threshold, the threshold has to be 50%.

ACCp: Adjusted Classify & Count, based on the Bayes classifier that maximises the difference of TPR (true positive rate) and FPR (false positive rate). ‘p’ because if the Bayes classifier is represented by means of the posterior probability of the positive class and a threshold, the threshold needs to be , the a priori probability (or prevalence) of the positive class in the training population. ACCp was called ‘method max’ in Forman (2008).

ACCv: New version of ACC where the threshold for the classifier is selected in such a way that the variance of the prevalence estimates is minimised among all ACCtype estimators based on classifiers represented by means of the posterior probability of the positive class and some threshold.

MS: ‘Median sweep’ as proposed by Forman (2008).

APCC: ‘Adjusted probabilistic classify & count’ (Bella et al., 2010, there called ‘scaled probability average’).

APCCv: New version of APCC where the a priori positive class probability parameter in the posterior positive class probability is selected in such a way that the variance of the prevalence estimates is minimised among all APCCtype estimators based on posterior positive class probabilities where the a priori positive class probability parameter varies between 0 and 1.

MLinf / MLboot: ML is the maximum likelihood approach to prevalence estimation (Peters and Coberly, 1976). Note that the EM (expectation maximisation) approach of Saerens et al. (2001) is one way to implement ML. ‘MLinf’ refers to construction of the prevalence confidence interval based on the asymptotic normality of the ML estimator (using the Fisher information for the variance). ‘MLboot’ refers to construction of the prevalence confidence interval solely based on bootstrap sampling.
For the readers’ convenience, the particulars needed to implement the methods in this list are presented in Appendix A. Note that ACC50, ACCp, ACCv, APCC und APCCv are all special cases of the ‘ratio estimator’ discussed in Vaz et al. (2019).
On the basis of the general asymptotic efficiency of maximum likelihood estimators (Theorem 10.1.12, Casella and Berger, 2002), the maximum likelihood approach for class prevalences is a promising approach for achieving minimum confidence intervals lengths. In addition, the ML approach may be considered a representative of the class of entropyrelated estimators and, as such, is closely related to the Topsøe approach which was found to perform very well in Maletzke et al. (2019).
2.3 Calculations performed in the simulation study
The calculations performed as part of the simulation study serve the purpose of providing facts for answers to the questions listed in Section 1 ‘Introduction’.
Calculations for constructing confidence intervals. Iterate times the following steps:

Create the training sample: Simulate times from features , , of positive instances and times from features , , of negative instances.

Create the test sample: Simulate the number of positive instances as a binomial random variable with size and success probability . Then simulate times from features , , of positive instances and times from features , , of negative instances. The information of whether a feature was sampled from or from is assumed to be unknown in the estimation step. Therefore, the gnerated features are combined in a single sample , , .

Iterate times the bootstrap procedure: Generate by stratified sampling with replications bootstrap samples , , of features of positive instances, , , of features of negative instances from the training subsamples, and , , of features with unknown labels from the test sample. Calculate, based on the three resulting bootstrap samples, estimates of the positive class prevalence in the test population according to all the estimation methods listed in Section 2.2.

For each estimation method, the bootstrap procedure from the previous step creates a sample of estimates of the positive class prevalence. Based on this sample of estimates, construct confidence intervals at level for the positive class prevalence in the test population.
Tabulated results of the simulation algorithm for confidence intervals.

For each estimation method, estimates of the positive class prevalence are calculated. From this set of estimates, the following summary results are derived and tabulated:

The average of the prevalence estimates.

The average absolute deviation of the prevalence estimates from the true prevalence parameter.

The percentage of simulation runs with failed prevalence estimates.

The percentage of estimates equal to 0 or 1.


For each estimation method, confidence intervals at level for the positive class prevalence are produced. From this set of confidence intervals, the following summary results are derived and tabulated:

The average length of the confidence intervals.

The percentage of confidence intervals that contain the true prevalence parameter (coverage rate).

For the construction of the bootstrap confidence intervals in Step 4 of the list of calculations, the method ‘perc’ (Davison and Hinkley, 1997, Section 5.3.1) of the function boot.ci of the Rpackage ‘boot’ is used. More accurate methods for bootstrap confidence intervals are available, but these tend to require more computational time and to be less robust. Given that the performance of ‘perc’ in the setting of this simulation study can be controlled via checking the coverage rates, the loss in performance seems tolerable. In the cases where calculations have resulted in coverage rates of less than the calculations have been repeated with the ‘bca‘ method (Davison and Hinkley, 1997, Section 5.3.2) of boot.ci in order to confirm the results.
Step 1 of the calculations can be omitted in the case , i.e. when the training sample is identical with the training population distribution. However, in this case some quantities of relevance for the estimates have to be precalculated before the entrance into the loop for the simulation runs. The details for these precalculations are provided in Appendix A. Also in the case , for the prevalence estimation methods ACC50, ACCp, ACCv, APCC und APCCv, the bootstrap confidence intervals for the prevalences are replaced by “conservative binomial intervals” (Meeker et al., 2017, Section 6.2.2), computed with the ‘exact’ method of the Rfunction binconf. Moreover, as explained in Section 2.2, in the case method MLinf is applied instead of MLboot for the construction of the maximum likelihood confidence interval.
As mentioned in Section 1 ‘Introduction’, one of the purposes of the simulation study is to illustrate the differences between confidence and prediction intervals. Conceptionally, the difference may be described by their definitions as given in Meeker et al. (2017)^{6}^{6}6‘’ as used by Meeker et al. (2017) corresponds to ‘’ as used in this paper. :

“A confidence interval for an unknown quantity may be formally characterized as follows: If one repeatedly calculates such intervals from many independent random samples, of the intervals would, in the long run, correctly include the actual value . Equivalently, one would, in the long run, be correct of the time in claiming that the actual value of is contained within the confidence interval.” (Meeker et al., 2017, Section 2.2.5)

“If from many independent pairs of random samples, a prediction interval is computed from the data of the first sample to contain the value(s) of the second sample, of the intervals would, in the long run, correctly bracket the future value(s). Equivalently, one would, in the long run, be correct of the time in claiming that the future value(s) will be contained within the prediction interval.” (Meeker et al., 2017, Section 2.3.6)
In order to construct prediction intervals instead of confidence intervals in the simulation runs, Step 4 of the calculations is modified as follows:

For each estimation method, the bootstrap procedure from the previous step creates a sample of estimates of the positive class prevalence. For each estimate, generate a virtual number of realisations of positive instances by simulating an inpendent binomial variable with size and success probability given by the estimate. Divide these virtual numbers by to obtain (for each estimation method) a sample of relative frequencies of positive labels. Based on this additional size sample of relative frequencies, construct prediction intervals at level for the percentage of instances with positive labels in the test sample.
As in the case of the construction of confidence intervals, for the construction of the prediction intervals again the method ‘perc’ of the function boot.ci of the Rpackage ‘boot’ is deployed.
Tabulated results of the simulation algorithm for prediction intervals.

For each estimation method, virtual relative frequencies of positive labels in the test sample are simulated under the assumption that the estimated positive class prevalence equals the true prevalence. From this set of frequencies, the following summary results are derived and tabulated:

The average of the virtual relative frequencies.

The average absolute deviation of the virtual relative frequencies from the true prevalence parameter.

The percentage of simulation runs with failed prevalence estimates and hence also failed simulations of virtural relative frequencies of positive labels.

The percentage of virtual relative frequencies equal to 0 or 1.


For each estimation method, prediction intervals at level for the realised relative frequencies of positive labels are produced. From this set of prediction intervals, the following summary results are derived and tabulated:

The average length of the prediction intervals.

The percentage of prediction intervals that contain the true relative frequencies of positive labels (coverage rate).

Row name  Explanation 

‘Av prev’  Average of the prevalence estimates (for confidence intervals) 
‘Av freq’  Average of the relative frequencies of simulated positive class labels (for prediction intervals) 
‘Av abs dev’  Average of the absolute deviation of the prevalence estimates 
or the simulated relative frequencies from the true prevalence  
‘Perc fail est’  Percentage of simulation runs with failed prevalence estimates 
‘Av int length’  Average of the confidence or prediction interval lengths 
‘Coverage’  Percentage of confidence intervals containing the true prevalence or 
of prediction intervals containing the true realised relative frequencies of positive labels  
‘Perc 0 or 1’  Percentage of prevalence estimates or simulated fequencies with value or 
3 Results of the simulation study
All simulation procedures are performed with parameter setting , and (see Section 2.1 for the complete list of control parameters). At each table in the following, the values selected for the remaining control parameters are listed in the captions or within the table bodies.
In all the simulation procedures run for this paper, the Rboot.ci method for determining the statistical intervals (both confidence and prediction) has been the method ‘perc’. In cases where the coverage found with ‘perc’ is significantly lower than 90% (for at 5% significance level this means lower than 85%), the calculation has been repeated with the Rboot.ci method ‘bca’ for confirmation or correction.
The naming of the table rows and table columns has been standardized. Unless mentioned otherwise, the columns always display results for all or some of the prevalence estimation methods listed in Section 2.2. Short explanations of the meaning of the row names are given in Table 1. A more detailed explanation of the row names can be found in Section 2.3.
3.1 Prediction vs. confidence intervals
In the simulation study performed for this paper, the values of the true positive class prevalences of the test samples – understood in the sense of the a priori positive class prevalences of the populations from which the samples were generated (see Section 2) – are always known. In contrast, when one is working with realworld data sets, there is no way to know with certainty the true positive class prevalences of the test samples. Inevitably, therefore, in studies of prevalence estimation methods on realworld data sets, the performance has to be measured by comparison between the estimates and the relative frequencies of the positive labels observed in the test samples.
This was stated explicitly, for instance, in Keith and O’Connor (2018). The authors said in the section ‘Problem definition’ of the paper that they estimated ‘prevalence confidence intervals’ with the property that “ of the predicted intervals ought to contain the true value ”. For this purpose, Keith and O’Connor defined the ‘true value’ as follows: “For each group , let be the true proportion of positive labels (where ).” As ‘group’ was used by Keith and O’Connor as equivalent to sample and was 1 for positive labels and 0 otherwise, it is clear that Keith and O’Connor estimated rather prediction intervals than confidence intervals (see Section 2.3 for the definitions of both types of intervals).
Hence, would it be worthwhile to distinguish confidence and prediction intervals for class prevalences and deploy different methods for their estimation, as has been asked in Section 1?
By assumption (see Section 2), the test sample is interpreted as the feature components of independent, identically distributed random variables , , . While the positive class prevalence in the test population is given by the constant , the relative frequency of the positive labels in test sample is represented by the random variable
(3.1) 
where if and otherwise.
The simulation procedures for the panels of Table 2 are intended to gauge the impact of using a confidence interval instead of a prediction interval for capturing the relative frequency of positive labels in the test sample as defined in (3.1
). By the law of large numbers, the difference of
and ought to be small for large . Therefore, if there is any impact of using a confidence interval when a prediction interval would be needed, it should rather be visible for smaller .The algorithm devised in this paper for the construction of prediction intervals (see Section 2.3) involves the simulation of binomial random variables with the prevalence estimates as success probabilities which are independent of the test samples. This procedure, however, is likely to exaggerate the variance of the relative frequencies of the positive labels because the prevalence estimates and the test samples are not only not independent but even by design should be strongly dependent. The dependence between prevalence estimate and the test sample should be the stronger, the more accurate the classifier underlying the estimator is. This implies that for prevalence estimation, differences between prediction and confidence intervals should rather be discernible for lower accuracy of the classifiers deployed.
, , , , prediction intervals  

ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
Av freq  19.26  20.70  16.02  20.72  20.42  18.74  18.92  19.72  19.76  19.38 
Av abs dev  7.50  8.82  9.02  7.60  7.58  7.58  7.40  7.24  7.28  7.50 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  32.20  33.30  29.98  30.36  30.04  29.68  30.32  29.86  30.00  29.88 
Coverage  100.0  99.0  94.0  100.0  99.0  98.0  99.0  99.0  99.0  98.0 
Perc 0 or 1  1.0  2.0  6.0  1.0  2.0  2.0  4.0  1.0  3.0  1.0 
, , , , confidence intervals  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
Av prev  20.34  21.05  17.01  20.50  20.54  20.34  20.63  20.31  20.55  20.65 
Av abs dev  6.68  7.23  7.61  6.22  6.12  6.13  6.06  5.88  6.18  6.00 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  28.27  28.98  26.13  25.67  25.15  24.55  25.57  25.10  25.10  24.83 
Coverage  97.0  97.0  89.0  98.0  97.0  96.0  98.0  95.0  97.0  98.0 
Perc 0 or 1  0.0  2.0  1.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0 
, , , , prediction intervals  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
Av freq  18.96  22.30  23.72  21.33  17.08  18.20  22.80  28.76  19.12  17.48 
Av abs dev  19.00  16.98  17.32  15.56  15.16  14.32  15.12  17.20  15.04  14.48 
Perc fail est  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  71.69  58.92  72.82  54.15  47.88  49.96  56.24  59.32  49.44  47.38 
Coverage  95.0  94.0  99.0  95.0  92.0  94.0  92.0  87.0  93.0  91.0 
Perc 0 or 1  43.0  26.0  19.0  17.2  32.0  24.0  20.0  9.0  24.0  24.0 
, , , , confidence intervals  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
Av prev  21.36  22.90  21.16  23.61  19.25  20.93  23.65  28.81  21.26  19.59 
Av abs dev  19.28  16.45  15.21  14.60  14.63  13.91  15.64  15.53  14.56  13.37 
Perc fail est  0.0  0.0  0.0  4.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  66.65  57.89  69.97  51.98  47.20  49.07  54.25  54.26  48.45  47.08 
Coverage  97.0  95.0  98.0  96.0  90.0  95.0  90.0  77.0  94.0  96.0 
Perc 0 or 1  36.0  24.0  22.0  15.6  27.0  22.0  19.0  7.0  22.0  16.0 
, , , , prediction intervals  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
AAv freq  13.90  10.78  12.48  12.02  10.50  10.96  13.74  19.86  11.80  8.48 
Av abs dev  13.06  10.34  11.84  10.94  10.36  10.44  12.88  16.52  11.12  8.44 
Perc fail est  0.0  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  62.61  44.90  59.71  44.94  39.12  41.48  46.78  52.30  41.30  35.30 
Coverage  98.0  97.0  97.0  94.0  93.0  92.0  90.0  75.0  92.0  93.0 
Perc 0 or 1  41.0  42.0  42.0  36.4  47.0  42.0  39.0  13.0  38.0  47.0 
, , , , confidence intervals  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
Av prev  13.43  11.88  11.65  12.15  9.52  10.40  12.77  18.30  10.70  8.33 
Av abs dev  13.79  11.11  11.03  10.68  9.56  9.71  11.44  15.43  10.01  8.34 
Perc fail est  0.0  0.0  0.0  4.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  58.34  45.14  55.64  41.20  34.38  37.22  43.41  48.10  37.30  32.95 
Coverage  98.0  93.0  96.0  93.0  89.0  91.0  83.0  62.0  94.0  92.0 
Perc 0 or 1  52.0  40.0  42.0  31.2  42.0  36.0  34.0  17.0  37.0  44.0 
Table 2 shows a number of simulation results, all for test sample size , i.e. for small size of the test sample:

Top two panels: Simulation of a ‘benign’ situation, with not too much difference of positive class prevalences (33% vs. 20%) in training and test population distributions, and high power of the score underlying the classifiers and distance minimisation approaches. Results suggest ‘overshooting’ by the binomial prediction interval approach, i.e. intervals are so long that coverage is much higher than requested, even reaching 100%. The confidence intervals clearly show sufficient coverage of the true realised percentages of positive labels for all estimation methods. Interval lengths are quite uniform, with only the straight ACC methods ACC50 and ACCp showing distinctly longer intervals. Also in terms of average absolute deviation from the true positive class prevalence, the performance is rather uniform. However, it is interesting to see that ACCv which has been designed for minimising confidence interval length among the ACC estimators shows the distinctly worst performance with regard to average absolute deviation.

Central two panels: Simulation of a rather adverse situation, with very different (67% vs. 20%) positive class prevalences in training and test population distributions and low power of the score underlying the classifiers and distance minimisation approaches. There is still overshooting by the binomial prediction interval approach for all methods but H8. For all methods but H8 sufficient coverage by the confidence intervals is still clearly achieved. H8 coverage of relative positive class frequency is significantly too low with the confidence intervals but still sufficient with the prediction intervals. However, H8 also displays heavy bias of the average relative frequency of positive labels, possibly a consequence of the combined difficulties of there being 8 bins for only 50 points (test sample size) and little difference between the densities of the score conditional on the two classes. In terms of interval length performance MLboot is best, closely followed by APCC and Energy. But even for these methods, confidence interval lengths of more then 47% suggest that the estimation task is rather hopeless.

Bottom two panels: Similar picture to the central panels, but even more adverse with a small test sample prevalence of 5%. Results similar, but much higher proportions of 0% estimates for all methods. H8 now has insufficient coverage with both prediction and confidence intervals, and also H4 coverage with the confidence intervals is insufficient. Note the strong estimation bias suggested by all average frequency estimates, presumably caused by the clipping of negative estimates (i.e. replacing such estimates by zero). Among all these bad estimators, MLboot is clearly best in terms of bias, average absolute deviation and interval lengths.

General conclusion: For all methods from Section 2.2 but the Hellinger methods, it suffices to construct confidence intervals. No need to apply special prediction interval techniques.

Performance in terms of interval length (with sufficient coverage in all circumstances): MLboot best, followed by APCC and Energy.
,  

ACC50  predACC50  ACCp  predACCp  DnPACC50  DnPACCp  
Av prev or freq  19.97  20.22  19.49  19.78  24.32  25.40 
Av abs dev  6.19  7.70  6.56  8.54  5.68  6.80 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  27.91  32.84  29.28  33.90  21.68  21.78 
Coverage  100.0  100.0  99.0  100.0  95.0  93.0 
Perc 0 or 1  0.0  0.0  1.0  2.0  0.0  0.0 
,  
ACC50  predACC50  ACCp  predACCp  DnPACC50  DnPACCp  
Av prev  19.69  19.38  19.58  20.28  38.06  39.26 
Av abs dev  9.91  9.82  9.03  10.32  18.18  19.26 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  32.41  36.34  30.49  34.72  26.28  26.32 
Coverage  98.0  99.0  95.0  98.0  22.0  14.0 
Perc 0 or 1  12.0  12.0  3.0  5.0  0.0  0.0 
Daughton and Paul (2019) proposed ‘Error Adjusted Bootstrapping’ as an approach to constructing “confidence intervals” (prediction intervals, as a matter of fact) for prevalences and showed by example that its performance in terms of coverage was sufficient. However, theoretical analysis of ‘Error Adjusted Bootstrapping’ presented in Appendix B suggests that this approach is not appropriate for constructing prediction intervals in the presence of prior probability shift. Indeed, Table 3 demonstrates that ‘Error Adjusted Bootstrapping’ intervals based on the classifiers ACC50 and ACCp (see Section 2.2) achieve sufficient coverage if the difference between the training and test sample prevalences is moderate (33% vs. 20%) but breaks down if the difference is large (67% vs. 20%).
, , , , ,  

ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLinf  
Av prev  20.28  20.28  19.68  20.34  20.35  20.27  20.27  20.33  20.34  20.35 
Av abs dev  1.97  1.97  1.90  1.79  1.79  1.72  1.81  1.76  1.80  1.72 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  8.47  8.47  8.04  7.74  7.71  7.50  7.54  7.42  7.72  7.35 
Coverage  91.0  91.0  93.0  90.0  89.0  92.0  89.0  89.0  89.0  91.0 
Perc 0 or 1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
, , , , ,  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLinf  
Av prev  20.71  20.71  18.60  20.74  20.53  20.00  20.47  20.47  20.66  20.41 
Av abs dev  4.68  4.68  4.21  3.64  3.37  3.23  3.65  3.30  3.39  3.19 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  19.08  19.08  18.47  16.70  15.73  15.15  16.14  15.50  15.91  15.05 
Coverage  92.0  92.0  93.0  94.0  94.0  94.0  96.0  95.0  95.0  94.0 
Perc 0 or 1  0.0  0.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
, , , , ,  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLinf  
Av prev  4.96  5.49  2.75  4.99  5.14  4.60  5.07  4.56  5.15  5.09 
Av abs dev  3.63  4.03  3.26  3.32  3.47  2.92  3.33  2.93  3.54  3.06 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  16.75  17.98  14.44  12.65  12.73  10.88  12.95  12.03  12.84  16.96 
Coverage  95.0  98.0  97.0  90.0  86.0  85.0  88.5  94.4  88.0  97.0 
Perc 0 or 1  24.0  24.0  27.0  16.0  18.0  11.0  18.0  10.0  18.0  13.0 
, , , , ,  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLinf  
Av prev  8.37  9.56  2.77  8.07  7.74  6.56  7.86  7.37  7.93  7.26 
Av abs dev  8.31  9.28  5.70  7.78  7.64  7.00  7.73  7.47  7.68  7.50 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  38.27  35.30  29.85  27.52  25.46  24.98  27.63  25.80  26.30  47.01 
Coverage  96.0  92.0  90.0  89.0  86.0  86.0  89.0  86.3  89.0  97.0 
Perc 0 or 1  38.0  44.0  65.0  32.0  41.0  48.0  41.0  45.0  43.0  48.0 
, , , ,  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
Av prev  5.72  6.91  3.62  5.67  5.88  5.72  5.53  5.65  5.97  5.59 
Av abs dev  4.46  5.10  3.54  4.11  4.11  3.97  4.00  3.95  4.18  3.55 
Perc fail est  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  16.62  18.68  11.35  14.96  14.92  13.90  14.22  13.44  15.09  13.95 
Coverage  89.0  87.0  91.0  88.0  86.0  87.0  84.8  82.0  85.0  85.0 
Perc 0 or 1  24.0  22.0  24.0  18.0  17.0  17.0  18.0  15.0  17.0  14.0 
, , , ,  
ACC50  ACCp  ACCv  MS  APCC  APCCv  H4  H8  Energy  MLboot  
Av prev  10.49  11.59  6.54  10.43  8.50  8.38  10.96  12.12  9.50  8.39 
Av abs dev  10.47  10.94  6.58  9.55  8.30  7.88  10.00  10.43  8.83  7.72 
Perc fail est  0.0  0.0  0.0  4.0  0.0  0.0  0.0  0.0  0.0  0.0 
Av int length  52.48  46.04  40.00  36.00  32.37  31.70  39.38  39.11  32.85  32.96 
Coverage  97.0  95.0  96.0  94.0  91.0  90.0  91.0  92.0  92.0  92.0 
Perc 0 or 1  43.0  37.0  43.0  32.3  41.0  38.0  36.0  28.0  40.0  38.0 
3.2 Does higher accuracy help for shorter confidence intervals?
As mentioned in Section 1, views in the literature differ on whether or not the performance of prevalence estimators is impacted by the discriminatory power of the score underlying the estimation method. Table 4 shows a number of simulation results, for a variety of sets of circumstances, both benign and adverse. Results for high and low power are juxtaposed:

Top two panels: Simulation of a ‘benign’ situation, with moderate difference of positive class prevalences (50% vs. 20%) in training and test population distributions, no estimation uncertainty on the training sample and a rather large test sample with . Results for all estimation methods suggest that the lengths of the confidence intervals are strongly dependent upon the discriminatory power of the score which is the basic building block of all the methods. Coverage is accurate for the high power situation whereas there is even slight overshooting of coverage in the low power situation.

Central two panels: Simulation of a less benign situation, with small test sample size and low true positive class prevalence in the test sample but still without uncertainty on the training sample. There is nonetheless again evidence for the strong dependence of the lengths of the confidence intervals upon the discriminatory power of the score. For all estimation methods, low power leads to strong bias of the prevalence estimates. The percentage of zero estimates jumps between the 3rd and the 4th panel. Hence, decrease of power of the score entails much higher rates of zero estimates. For the maximum likelihood method, the interval length results in both panels show that constructing confidence intervals based on the central limit theorem for maximum likelihood estimators may become unstable for small test sample size and small positive class prevalence.

Bottom two panels: Simulation of an adverse situation, with small test and training sample sizes and low true positive class prevalence in the test sample. Results show qualitatively very much the same picture as in the central panels. The impact of estimation uncertainty in the training sample which marks the difference to the situation for the central panels, however, is moderate for high power of the score but dramatic for low power of the score. Again there is a jump of the rate of zero estimates between the two panels differentiated by different levels of discriminatory power. For the Hellinger methods, results of the high power panel suggest a performance issue^{7}^{7}7This observation is not confirmed by a repetition of the calculations for Panel 5 with deployment of Rboot.ci method ‘bca‘ instead of ‘perc’. However, the ‘better’ results are accompanied by a high rate of failures of the confidence interval construction. with respect to the coverage rate. In contrast to MLinf, MLboot (using only bootstrapping for constructing the confidence intervals) performs well, even with relatively low bias for the prevalence estimate in the low power case.

General conclusion: The results displayed in Table 4 suggest that there should be a clear benefit in terms of shorter confidence intervals when high power scores and classifiers are deployed for prevalence estimation. In addition, the results illustrate the statement on the asymptotic variance of ratio estimators like ACC50, ACCp, APCv, APCC and APCCv in Corollary 11 of Vaz et al. (2019).

Performance in terms of interval length (with sufficient coverage in all circumstances): Both APCC estimation methods show good and stable performance when compared to all other methods. Energy and MLboot follow closely. The Hellinger methods also produce short confidence lengths but may have insufficient coverage.
3.3 Do approaches to confidence intervals without Monte Carlo simulations work?
For the prevalence estimation methods ACC50, ACCp, ACCv, MS, and MLinf, confidence intervals can be constructed without bootstrapping and, therefore, much less numerical effort. For ACC50, ACCp, ACCv, and MS, conservative binomial intervals by means of the ‘exact’ method of Rfunction binconf can be deployed (Meeker et al., 2017, Section 6.2.2). For the maximum likelihood approach, an asymptotically most efficient normal approximation with variance expressed in terms of the Fisher information can be used (Theorem 10.1.12, Casella and Berger, 2002). This approach is denoted by ‘MLinf’ in order to distinguish it from ‘MLboot‘, maximum likelihood estimation combined with bootstrapping for the confidence intervals.
However, it can be shown by examples that these nonsimulation approaches fail in the sense of producing insufficient coverage rates if training sample sizes are finite, i.e. if parameters like true positive and false positive rates needed for the estimators have to be estimated (e.g. by means of regression) before being plugged in. Table 5 with panels juxtaposing results for infinite sample and finite sample sizes of the training sample, provides such an example.
The estimation problem whose results are shown in Table 5 is pretty wellposed, with a large test sample, a high power score underlying the estimation methods and moderate difference between training and test sample positive class prevalences. Panel 1 shows that without estimation uncertainty on the training sample (infinite sample size) the nonsimulation approaches produce confidence intervals with sufficient coverage. In contrast, Panel 2 demonstrates that for all five methods coverage breaks down when estimation uncertainty is introduced into the training sample (finite sample size). According to Panel 3, this issue can be remediated by deploying bootstrapping for the construction of the confidence intervals.
No bootstrap, ,  

ACC50  ACCp  ACCv  MS  MLinf  
Av prev  20.15  20.28  19.40  20.14  20.27 
Av abs dev  2.25  2.24  2.18  2.22  2.02 
Perc fail est  0.0  0.0  0.0  0.0  0.0 
Av int length  8.11  8.46  8.02  8.17  7.33 
Coverage  92.0  86.0  88.0  87.0  92.0 
Perc 0 or 1  0.0  0.0  0.0  0.0  0.0 
No bootstrap, ,  
ACC50  ACCp  ACCv  MS  MLinf  
Av prev  19.95  19.48  19.01  19.76  20.06 
Av abs dev  3.23  3.88  2.98  3.35  2.57 
Perc fail est  0.0  0.0  0.0  0.0  0.0 
Av int length  8.13  8.47  7.60  8.16  7.22 
Coverage  69.0  64.0  66.0  66.0  75.0 
Perc 0 or 1  0.0  0.0  0.0  0.0  0.0 
Bootstrap, ,  
ACC50  ACCp  ACCv  MS  MLboot  
Av prev  19.79  19.94  18.63  20.18  20.34 
Av abs dev  3.11  3.38  3.17  2.79  2.67 
Perc fail est  0.0  0.0  0.0  0.0  0.0 
Av int length  15.13  16.96  14.18  12.94  12.07 
Coverage  95.0  93.0  91.0  91.0  92.0 
Perc 0 or 1  0.0  0.0  0.0  0.0  0.0 
4 Conclusions
The simulation study whose results are reported in this paper has been intended to shed some light on certain questions from the literature regarding the construction of confidence or prediction intervals for the prevalence of positive labels in binary quantification problems. In particular, the results of the study should help to provide answers to the questions of

whether estimation techniques for confidence intervals are appropriate if in practice most of the time prediction intervals are needed, and

whether the discriminatory power of the soft classifier or score at the basis of a prevalence estimation method matters when it comes to minimizing the confidence interval for an estimate.
The answers suggested by the results of the simulation study are subject to a number of qualifications. Most prominent among the qualifications are

the fact that the findings of the paper apply only for problems where it is clear that training and test sample are related by prior probability shift, and

the general observation that the scope of a simulation study necessarily is rather restricted and therefore findings of such studies can be suggestive and illustrative at best.
Hence the findings from the study do not allow firm or general conclusions. As a consequence, the answers to the questions suggested by the simulation study have to be ingested with caution:

For not too small test sample sizes like 50 or more, there is no need to deploy special techniques for prediction intervals.

It is worthwhile to base prevalence estimation on powerful classifiers or scores because this way the lengths of the confidence intervals can be much reduced. The use of less accurate classifiers may entail confidence intervals so long that the estimates have to be considered worthless.
In most of the experiments performed as part of the simulation study, the maximum likelihood approach (method MLboot) to the estimation of the positive class prevalence turned out to deliver on average the shortest confidence intervals. As shown in Appendix A.2.3, application of the maximum likelihood approach requires that in a previous step the density ratio or the posterior class probabilities are estimated on the training samples. To achieve this with sufficient precision is a notoriously hard problem. Note, however, the promising recent progress made on this issue (Kull et al., 2017). Not much worse and in a few cases even superior was the performance of APCC (Adjusted Probabilistic Classify & Count). In contrast the performance of the Energy distance and Hellinger distance estimation methods was not outstanding and, in the case of the latter methods, even insufficient in the sense of not guaranteeing the required coverage rates of the confidence intervals.
Recent research by Maletzke et al. (2019) singled out prevalence estimation methods based on minimising distances related to the Earth Mover’s distance as very well and robustly performing. Earlier research by Hofer (2015) already found that prevalence estimation by minimising the Earth Mover’s distance worked well in the presence of general data set shift (‘local drift’). Hence it might be worthwhile to compare the performance of such estimators with respect to the length of confidence intervals to the performance of other estimators like the ones considered in this paper.
References
 Barranquero et al. (2013) J. Barranquero, P. González, J. Díez, and J.J. Del Coz. On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition, 46(2):472–482, 2013.
 Barranquero et al. (2015) J. Barranquero, J. Díez, and J.J. del Coz. Quantificationoriented learning based on reliable classifiers. Pattern Recognition, 48(2):591–604, 2015.
 Bella et al. (2010) A. Bella, C. Ferri, J. HernandezOrallo, and M.J. RamírezQuintana. Quantification via probability estimators. In Data Mining (ICDM), 2010 IEEE 10th International Conference on, pages 737–742. IEEE, 2010.
 Casella and Berger (2002) G. Casella and R.L. Berger. Statistical Inference. Duxbury Press, second edition, 2002.
 Castaño et al. (2018) A. Castaño, L. MoránFernández, J. Alonso, V. BolónCanedo, A. AlonsoBetanzos, and J.J. del Coz. Análisis de algoritmos de cuantificacíon basados en ajuste de distribuciones. In IX Simposio de Teoría y Aplicaciones de la Minería de Datos (IX TAMIDA), pages 913–918. Asociación Española para la Inteligencia Artificial, 2018.
 Cramer (2003) J.S. Cramer. Logit Models From Economics and Other Fields. Cambridge University Press, 2003.
 Daughton and Paul (2019) A.R. Daughton and M.J. Paul. Constructing Accurate Confidence Intervals when Aggregating Social Media Data for Public Health Monitoring. In AAAI International Workshop on Health Intelligence (W3PHIAI), January 2019.
 Davison and Hinkley (1997) A.C. Davison and D.V. Hinkley. Bootstrap Methods and their Application. Cambridge University Press, 1997.
 Du Plessis and Sugiyama (2014) M.C. Du Plessis and M. Sugiyama. Semisupervised learning of class balance under classprior change by distribution matching. Neural Networks, 50:110–119, 2014.
 Du Plessis et al. (2017) M.C. Du Plessis, G. Niu, and M. Sugiyama. Classprior estimation for learning from positive and unlabeled data. Machine Learning, 106(4):463–492, 2017. doi: 10.1007/s1099401656046.
 Forman (2008) G. Forman. Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2):164–206, 2008.
 FrühwirthSchnatter (2006) S. FrühwirthSchnatter. Finite Mixture and Markov Switching Models: Modeling and Applications to Random Processes. Springer, 2006.
 Gart and Buck (1966) J.J. Gart and A.A. Buck. Comparison of a screening test and a reference test in epidemiologic studies. II. A probabilistic model for the comparison of diagnostic tests. American Journal of Epidemiology, 83(3):593–602, 1966.
 González et al. (2017) P. González, A. Castaño, N.V. Chawla, and J.J. Del Coz. A Review on Quantification Learning. ACM Comput. Surv., 50(5):74:1–74:40, 2017.
 GonzálezCastro et al. (2013) V. GonzálezCastro, R. AlaizRodríguez, and E. Alegre. Class distribution estimation based on the Hellinger distance. Information Sciences, 218:146–164, 2013.
 Hofer (2015) V. Hofer. Adapting a classification rule to local and global shift when only unlabelled data are available. European Journal of Operational Research, 243(1):177–189, 2015.
 Hofer and Krempl (2013) V. Hofer and G. Krempl. Drift mining in data: A framework for addressing drift in classification. Computational Statistics & Data Analysis, 57(1):377–391, 2013.
 Hopkins and King (2010) D.J. Hopkins and G. King. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science, 54(1):229–247, 2010.
 Kawakubo et al. (2016) H. Kawakubo, M.C. du Plessis, and M. Sugiyama. Computationally Efficient ClassPrior Estimation under Class Balance Change Using Energy Distance. IEICE Transactions on Information and Systems, 99(1):176–186, 2016.

Keith and O’Connor (2018)
K. Keith and B. O’Connor.
Uncertaintyaware generative models for inferring document class
prevalence.
In
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pages 4575–4585. Association for Computational Linguistics, 2018.  Kull et al. (2017) M. Kull, T.M. Silva Filho, and P. Flach. Beyond sigmoids: How to obtain wellcalibrated probabilities from binary classifiers with beta calibration. Electron. J. Statist., 11(2):5052–5080, 2017. doi: 10.1214/17EJS1338SI.
 Lipton et al. (2018) Z. Lipton, Y.X. Wang, and A. Smola. Detecting and Correcting for Label Shift with Black Box Predictors. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 3122–3130. PMLR, 10–15 Jul 2018.

Maletzke et al. (2019)
A. Maletzke, D.M. Dos Reis, E. Cherman, and G. Batista.
DyS: a Framework for Mixture Models in Quantification.
In
The ThirtyThird AAAI Conference on Artificial Intelligence (AAAI19)
, 2019.  Meeker et al. (2017) W.Q. Meeker, G.J. Hahn, and L.A. Escobar. Statistical intervals: a guide for practitioners and researchers. John Wiley & Sons, second edition, 2017.
 MorenoTorres et al. (2012) J.G. MorenoTorres, T. Raeder, R. AlaizRodriguez, N.V. Chawla, and F. Herrera. A unifying view on dataset shift in classification. Pattern Recognition, 45(1):521–530, 2012.
 Peters and Coberly (1976) C. Peters and W.A. Coberly. The numerical evaluation of the maximumlikelihood estimate of mixture proportions. Communications in Statistics – Theory and Methods, 5(12):1127–1135, 1976.
 R Core Team (2014) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2014. URL http://www.Rproject.org/.
 Redner and Walker (1984) R.A. Redner and H.F. Walker. Mixture densities, maximum likelihood and the EM algorithm. SIAM review, 26(2):195–239, 1984.
 Saerens et al. (2001) M. Saerens, P. Latinne, and C. Decaestecker. Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation, 14(1):21–41, 2001.
 Tasche (2013) D. Tasche. The law of total odds. arXiv preprint arXiv:1312.0365, 2013.
 Tasche (2017) D. Tasche. Fisher Consistency for Prior Probability Shift. Journal of Machine Learning Research, 18(95):1–32, 2017. URL http://jmlr.org/papers/v18/17048.html.
 Vaz et al. (2019) A.F. Vaz, R. Izbicki, and R.B. Stern. Quantification Under Prior Probability Shift: the Ratio Estimator and its Extensions. Journal of Machine Learning Research, 20(79):1–33, 2019. URL http://jmlr.org/papers/v20/18456.html.
 Zhang et al. (2013) K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang. Domain Adaptation Under Target and Conditional Shift. In Proceedings of the 30th International Conference on International Conference on Machine Learning – Volume 28, ICML’13, pages III–819–III–827. JMLR.org, 2013. URL http://dl.acm.org/citation.cfm?id=3042817.3043028.
Appendix A Appendix: Particulars for the implementation of the simulation study
This appendix presents the mathematical details needed for coding the prevalence estimation methods listed in Section 2.2. In particular, the case of infinite training samples (i.e. where the training sample is actually the training population and the parameters of the model are exactly known) is covered.
a.1 Adjusted Classify & Count (ACC) and related prevalence estimators
ACC and APCC as mentioned in Section 2.2 are special cases of the ‘ratio estimator’ of Vaz et al. (2019)
. From an even more general perspective, they are instances of estimation by the ‘Method of Moments’
(FrühwirthSchnatter, 2006, Section 2.4.1 and the references therein). By Theorem 6 of Vaz et al. (2019), ratio estimators are Fisher consistent for estimating the positive class prevalence of the test population under prior probability shift.Adjusted Classify & Count (ACC). In the setting of Section 2, denote the feature space (i.e. the range of values which the feature variable can take) by . Let be a crisp classifier in the sense that if for an instance it holds that , a positive class label is predicted, and if a negative class label is predicted. With the notation introduced in Section 2, the ACC estimator based on the classifier of the test population positive class prevalence is given by
(A.1) 
Recall that

is the proportion of instances in the test population whose labels are predicted positive by the classifier .

is the false positive rate (FPR) associated with the classifier . The FPR equals and, therefore, also of the classifier .

is the true positive rate (TPR) associated with the classifier . The TPR is also called ‘recall’ or ‘sensitivity’ of .
Of course, the ACC estimator of (A.1) is defined only if , i.e. if is not completely inaccurate. González et al. (2017, Section 6.2) gave some background information on the history of ACC estimators.
When a threshold is fixed, the soft classifier gives rise to a crisp classifier , defined by
(A.2) 
The classifiers with
(A.3) 
are Bayes classifiers which minimise costsensitive Bayes errors, see for instance Tasche (2017, Section 2.1). Thresholds of special interest are

for maximum accuracy (i.e. minimum classification error) which leads to the estimator ACC50 listed in Section 2.2, and
For the simulation procedures run for this paper, a sample version of has been used:
(A.4) 
where denotes a sample generated under the test population distribution .
To deal with the case where in the setting of Section 2.1 with the double binormal model the training sample is infinite, the following formulae have been coded for the righthand side of (A.1) with and (A.4) (with parameters as in (2.3a)):
(A.5)  
Adjusted Probabilistic Classify & Count (APCC). APCC – called scaled probability average by Bella et al. (2010) – generalises (A.1) by replacing the indicator variable with a realvalued random variable . If only takes values in the unit interval the variable is a randomized decision classifier (RDC) which may be interpreted as the probability with which the positive label should be assigned. Eq. (A.1) modified for APCC reads:
(A.6) 
Bella et al. (2010) suggested the choice .
For the simulation procedures run for this paper, a sample version of has been used:
(A.7) 
where denotes a sample generated under the test population distribution .
To deal with the case where in the setting of Section 2.1 with the double binormal model the training sample is infinite, the following formulae have been coded for the righthand side of (A.6) with and (A.7) (with parameters as in (2.3a)):
(A.8)  
Median sweep (MS). Forman (2008) proposed to stabilise the prevalence estimates from ACC based on a soft classifier via (A.2), by taking the median of all ACC estimates based on for all thresholds such that the denominator of the righthand side of (A.1) exceeds 25%. For the purpose of this paper, the base soft classifier is in connection with (A.3), and the set of possible thresholds is restricted to .
Tuning ACC for ACCv. Observe that a main factor impacting the length of a confidence interval for a parameter is the standard deviation of the underlying estimator. This suggests the following approach to choosing a good threshold for the classifier in (A.3):
(A.9) 
The test population distribution appears in the numerator of (A.9) because the confidence interval is calculated for a sample generated from . The training population distribution is used in the denominator of (A.9) because the confidence interval is scaled by the denominator of (A.1). See (A.5) for the formulae used for the calculations of this paper for (A.9) in the setting of Section 2.1. Like in the case of MS, for the purpose of this paper the set of possible thresholds is restricted to .
Tuning APCC for APCCv. Similarly to (A.9), the idea is to minimise the variance of the estimator under while controlling the size of the denominator in (A.6). For define
where and are the classconditional densities of the features. Then it holds that
A good choice for could be with
(A.10) 
For the purpose of this paper the set of possible parameters in (A.10) is restricted to . In the setting of Section 2.1, let be defined as in (2.3a) and let
Then, analogously to (A.8), in the setting of Section 2.1 the following formulae are obtained for use in the calculations of this paper for (A.10):
(A.11)  
a.2 Prevalence estimation by distance minimisation
The idea for prevalence estimation by distance minimisation is to obtain an estimate of by solving the following optimisation problem:
(A.12) 
Here denotes a distance measure of probability measures with the following two properties:

for all probability measures , to which is applicable.

if and only if .
There is no need for to be a metric (i.e. asymmetric distance measures with for some , are permitted). By property 2), distance minimisation estimators defined by (A.12) are Fisher consistent for estimating the positive class prevalence of the test population under prior probability shift. In the following subsections three approaches to prevalence estimation based on distance minimisation are introduced that have been suggested in the literature and appear to be popular.
a.2.1 Prevalence estimation by minimising the Hellinger distance
The Hellinger distance^{8}^{8}8See GonzálezCastro et al. (2013) and Castaño et al. (2018) for more information on the Hellinger distance approach to prevalence estimation. of two probability measures , on the same domain is defined in measuretheoretic terms by
(A.13) 
where is any measure on the same domain such that both and are absolutely continuous with respect to . The value of does not depend upon the choice of .
In practice, the calculation of the Hellinger distance must take into account that most of time it has to be estimated from sample data. Therefore, the righthand side of (A.13) is discretized by (in the setting of Section 2) decomposing the feature space into a finite number of subsets or bins and evaluating the probability measures whose distance is to be measured on these bins. This leads to the following approximative version of the minimisation problem (A.12):
(A.14) 
If the feature space is multidimensional, e.g. for some , GonzálezCastro et al. (2013) also suggest minimising the Hellinger distance separately across all the
dimensions of the feature vector
. In this case, the feature space is decomposed componentwise in bins and (A.14) is modified to becomewhere . For the purposes of this paper, (A.14) has been adapted to become
(A.15) 
where is a sample of features of instances generated under the test population distribution and the terms must be estimated from the training sample if it is finite and can be exactly precalculated in the case of an infinite training sample. In the latter case, (A.15) has to be modified to reflect the binormal setting of Section 2.1 for the training population distribution:
(A.16a)  
if for . For this paper, the number of bins^{9}^{9}9See Maletzke et al. (2019) for critical comments regarding the choice of the number of bins. in (A.16a) has been chosen to be 4 or 8, and the boundaries of the bins have been defined as follows^{10}^{10}10 is the inverse function to the standard normal distribution function.:  
(A.16b) 
a.2.2 Prevalence estimation by minimising the Energy distance
Kawakubo et al. (2016) and Castaño et al. (2018) provide background information for the application of the Energy distance approach to prevalence estimation.
Denote by and respectively the projection on the first components and the last components respectively of , i.e.