In multi-class classification, one observes pairs where
are feature vectors, andare unknown labels, which lie in a countable label set . The goal is to construct a classification rule for predicting the label of a new data point; generally, the classification rule
is learned from previously observed data points. In many applications of multi-class classification, such as face recognition or image recognition, the space of potential labels is practically infinite. In such a setting, one might consider a sequence of classification problems on finite label subsets, where in the -th problem, one constructs the classification rule . Supposing that
have a joint distribution, define the accuracy for the-th problem as
Using data from only , can one predict the accuracy achieved on the larger label set , with ? This is the problem of performance extrapolation.
A practical instance of performance extrapolation occurs in neuroimaging studies, where the number of classes is limited by experimental considerations. Kay et al.  obtained fMRI brain scans which record how a single subject’s visual cortex responds to natural images. The label set corresponds to the space of all grayscale photographs of natural images, and the set is a subset of 1750 photographs used in the experiment. They construct a classifier which achieves over 0.75 accuracy for classifying the 1750 photographs; based on exponential extrapolation, they estimate that it would take on the order of photographs before the accuracy of the model drops below 0.10! Directly validating this estimate would take immense resources, so it would be useful to develop the theory needed to understand how to compute such extrapolations in a principled way.
However, in the fully general setting, it is impossible on construct non-trivial bounds on the accuracy achieved on the new classes based only on knowledge of : after all, could consist entirely of well-separated classes while the new classes consist entirely of highly inseparable classes, or vice-versa. Thus, the most important assumption for our theory is that of exchangeable sampling. The labels in are assumed to be an exchangeable sample from . The condition of exchangeability ensures that the separability of random subsets of can be inferred by looking at the empirical distributions in , and therefore that some estimate of the achievable accuracy on can be obtained.
The assumption of exchangeability greatly limits the scope of application for our methods. Many multi-class classification problems have a hierarchical structure , or have class labels distributed according to non-uniform discrete distributions, e.g. power laws ; in either case, exchangeability is violated. It would be interesting to extend our theory to the hierarchical setting, or to handle non-hierarchical settings with non-uniform prior class probabilities, but again we leave the subject for future work.
In addition to the assumption of exchangeability, we consider a restricted set of classifiers. We focus on generative classifiers, which are classifiers that work by training a model separately on each class. This convenient property allows us to characterize the accuracy of the classifier by selectively conditioning on one class at a time. In section 3, we use this technique to reveal an equivalence between the expected accuracies of
to moments of a common distribution. This moment equivalence result allows standard approaches in statistics, such as U-statistics and nonparametric pseudolikelilood, to be directly applied to the extrapolation problem, as we discuss in section 4. In non-generative classifiers, the classification rule has a joint dependence on the entire set of classes, and cannot be analyzed by conditioning on individual classes. In section 5, we empirically study the performance of our classifiers. Since generative classifiers only comprise a minority of the classifiers used in practice, we applied our methods to a variety of generative and non-generative classifiers in simulations and in one OCR dataset. Our methods have varying success on generative and non-generative classifiers, but seem to work badly for neural networks.
To our knowledge, we are the first to formalize the problem of prediction extrapolation. We introduce three methods for prediction extrapolation: the method of extended unbiased estimation and the constrained pseudolikelihood method are novel. The third method, based on asymptotics, is a new application of a recently proposed method for estimating mutual information .
Having motivated the problem of performance extrapolation, we now reformulate the problem for notational and theoretical convenience. Instead of requiring to be a random subset of as we did in section 1, take and . We fix the size of without losing generality, since any monotonic sequence of finite subsets can be embedded in a sequence with . In addition, rather than randomizing the labels, we will randomize the marginal distribution of each label; Towards that end, let be a space of feature vectors, and let
be a measurable space of probability distributions on. Let be a probability measure on , and let be an infinite sequence of i.i.d. draws from . We refer to , a probability measure on probability measures, as a meta-distribution. The distributions are the marginal distributions of the first classes. Further assuming that the labels are equiprobable, we rewrite the accuracy as
where the probabilities are taken over .
In order to construct the classification rule , we need data from the classes . In most instances of multi-class classification, one observes independent observations from each which are used to construct the classifier. Since the order of the observations does not generally matter, a sufficient statistic for the training data for the -th classification problem is the collection of empirical distributions for each class. Henceforth, we make the simplifying assumption that the training data for the -th class remains fixed from , so we drop the superscript on . Write for the conditional distribution of given ; also write for the marginal distribution of when As an example, suppose every class has the number of training examples ; then is the empirical distribution of i.i.d. observations from , and is the empirical meta-distribution of . Meanwhile, is the true meta-distribution of the empirical distribution of i.i.d. draws from a random .
2.1 Multiclass classification
Extending the formalism of Tewari and Bartlett 111As in their framework, we define a classifier as a vector-valued function. However, we introduce the notion of a classifier as a multiple-argument functional on empirical distributions, which echoes the functional formulation of estimators common in the statistical literature., we define a classifier as a collection of mappings called classification functions. Intuitively speaking, each classification function learns a model from the first arguments, which are the empirical marginals of the classes, . For each class, the classifier assigns a real-valued classification score to the query point . A higher score indicates a higher estimated probability that belongs to the -th class. Therefore, the classification rule corresponding to a classifier assigns a class with maximum classification score to :
For some classifiers, the classification functions are especially simple in that is only a function of and . Furthermore, due to symmetry, in such cases one can write
where is called a single-class classification function (or simply classification function),
and we say that is a generative classifier.
Quadratic discriminant analysis and Naive Bayes  are two examples of
generative classifiers222For QDA, the classification function is given by
3 Performance extrapolation for generative classifiers
Let us specialize to the case of a generative classifier, with classification function . Consider estimating the expected accuracy for the -th classification problem,
In the case of a generative classifier, we have
Define the conditional accuracy function which maps a distribution on and a test observation to a real number in . The conditional accuracy gives the probability that for independently drawn from , that will be greater than :
Define the conditional accuracy distribution as the law of where and are generated as follows: (i) a true distribution is drawn from ; (ii) the empirical distribution is drawn from (i.e., the training data for the class), (iii) the query is drawn from , with independent of (i.e. a single test data point from the same class.) The significance of the conditional accuracy distribution is that the expected accuracy can be written in terms of its moments.
Theorem 3.1. Let be a single-distribution classification function, and let , be a distribution on Further assume that and jointly satisfy the tie-breaking property:
for all , where .
Let be defined as the random variable
be defined as the random variablefor , , and with . Then
where is the expected accuracy as defined by (1).
Proof. Write . By using conditioning and conditional independence, can be written
Theorem 3.1 tells us that the problem of extrapolation can be approached by attempting to estimate the conditional accuracy distribution. The -th moment of gives us , which will in turn be a good estimate of .
While is not directly observed, we can obtain unbiased estimates of by using test data. For any , and independent test point , define
Then is an unbiased estimate of , as stated in the following theorem.
Theorem 3.2. Assume the conditions of theorem 3.1. Then defining
In section 4, we will use this result to estimate the moments of . Meanwhile, since is a random variable on , we also conclude that follows a mixed exponential decay. Let be the law of . Then from change-of-variables , we get
This fact immediately suggests the technique of fitting a mixture of exponentials to the test accuracy at : we explore this idea further in Section 4.1.
3.1 Properties of the conditional accuracy distribution
The conditional accuracy distribution is determined by and . What can we say about the the conditional accuracy distribution without making any assumptions on either or ? The answer is: not much. For an arbitrary probability measure on , one can construct and such that the conditional accuracy has the distribution , even if one makes the perfect sampling assumption that
Theorem 3.3. Let be defined as in Theorem 3.1, and let denote the law of . Then, for any probability distribution on , one can construct a meta-distribution and a classification function such that the conditional accuracy has distribution under perfect sampling (that is, .)
Proof. Let be the cdf of , , and let . Define by
Let , and define by , and also A straightforward calculation yields that .
On the other hand, we can obtain a positive result if we assume that the classifier approximates a Bayes classifier. Assuming that is absolutely continuous with respect to Lebesgue measure with probability one, a Bayes classifier results from assuming perfect sampling () and taking . Theorem 3.4. states that for a Bayes classifier, the measure has a density which is monotonically increasing. Since a ‘good’ classifier approximates the Bayes classifier, we intuitively expect that a monotonically increasing density is a good model for the conditional accuracy distribution of a ‘good’ classifier.
Theorem 3.4. Assume the conditions of theorem 3.1, and further suppose
that , is absolutely continuous with respect to with probability one,
that , and that has a regular conditional probability distribution.
has a regular conditional probability distribution. Letdenote the law of . Then has a density on which is monotonic in .
Proof. It suffices to prove that
for all and . Let denote the space of distributions supported on which are absolutely continuous with respect to -dimensional Lebesgue measure . Let denote the marginal distribution of for with . Define the set
for all One can verify that for all ,
using the fact that has no atoms. Hence, we obtain
Taking , we conclude the theorem.
Suppose we have independent test repeats per class, . Let us define
which coincides with the definition (4) in the special case that is generative.
At a high level, we have a hierarchical model where is drawn from a distribution on and then . Let us assume that has a density : then the marginal distribution of can be written
However, the observed do not comprise an i.i.d. sample.
We discuss the following three approaches for estimating based on . The first is an extension of unbiased estimation based on binomial U-statistics, which is discussed in Section 4.1. The second is the pseudolikelihood approach. In problems where the marginal distributions are known, but the dependence structure between variables is unknown, the pseudolikelihood is defined as the product of the marginal distributions. For certain problems in time series analysis and spatial statistics, the maximum pseudolikelihood estimator (MPLE) is proved to be consistent . We discuss pseudolikelihood-based approaches in Section 4.2. Thirdly, we note that the high-dimensional theory of Anon 2016  can be applied for prediction accuracy, which we discuss in Section 4.3.
4.1 Extensions of unbiased estimation
If , then an unbiased estimator of exists if and only if .
The theory of U-statistics  provides the minimal variance unbiased estimator for:
This result can be immediately applied to yield an unbiased estimator of , when :
However, since is undefined for , we can use exponential extrapolation to define an extended estimator for . Let be a measure defined by solving the optimization problem
After discretizing the measure , we obtain a convex optimization problem which can be solved using non-negative least squares . Then define
4.2 Maximum pseudolikelihood
|Estimated density||Estimated moment|
The (log) pseudolikelihood is defined as
and a maximum pseudolikelihood estimator (MPLE) is defined as any density such that
The motivation for is that it consistently estimates in the limit where . However, in finite samples, is not uniquely defined, and if we define the plug-in estimator
can vary over a large range, depending on which is selected. These shortcomings motivate the adoption of additional constraints on the estimator .
Theorem 3.4. motivates the monotonicity constraint that . A second constraint is to restrict the -th moment of to match the unbiased estimate. The addition of these constraints yields the constrained PMLE , which is obtained by solving
By discretizing , all of the above maximization problems can be solved using a general-purpose convex solver333 We found that the disciplined convex programming language CVX, using the ECOS second-order cone programming solver, succeeds in optimizing the problems where the dimension of the discretized is as large as 10,000 [10, 11].. While the added constraints do not guarantee a unique solution, they improve estimation of and thus improve moment estimation (Figure 1.)
4.3 High-dimensional asymptotics
Under a number of conditions on the distribution , including (but not limited to) having a large dimension , Anon  relate the accuracy of the Bayes classifier to the mutual information between the label and the response :
While our goal is not to estimate the mutual information, we note that the results of Anon 2016 imply a relationship between and for the Bayes accuracy under the high-dimensional regime:
Therefore, under the high-dimensional conditions of  and assuming that the classifier approximates the Bayes classifier, we naturally obtain the following estimator
We applied the methods described in Section 4 on a simulated gaussian mixture (Figure 2) and on a Telugu character classification task  (Table 1.)
For the simulated gaussian mixture, we vary the size of the initial subset from classes to
classes, and extrapolate the performance for gaussian mixture model, multinomial logistic, and one-layer neural network (with 10 sigmoidal units.) Figure 3 shows how the predicted-class accuracy changes as is varied. We see that the predicted accuracy curves for QDA and Logistic have similar behavior, even though QDA is generative and multinomial logistic is not. All three methods perform better on QDA and logistic classifiers than on the neural network: in fact, for the neural network, the test accuracy of the initial set, , becomes a better estimator of than the three proposed methods for most of the curve. We also see that the exponential extrapolation method, , is more variable than constrained pseudolikelihood and high-dimensional estimator . Additional simulation results can be found in the supplement.
In the character classification task, we predict the 400-class accuracy of naive Bayes, multinomial logistic regression, SVM ,-nearest neighbors444-nearest neighbors with for fixed , and deep neural networks555The network architecture is as follows: 48x48-4C3-MP2-6C3-8C3-MP2-32C3-50C3-MP2-200C3-SM. 48x48 binary input image, C3 is a 3x3 convolutional layer with
output maps, MP2 is a 2x2 max-pooling layer, and SM is a softmax output layer on 20 or 400 classes.using 20-class data with 103 training examples per class (Table 1). Taking the test accuracy on 400 classes (using 50 test examples per class) as a proxy for , we compare the performance of the three extrapolation methods; as a benchmark, also consider using the test accuracy on 20 classes as an estimate. The exponential extrapolation method performs well only for the deep neural network. Meanwhile, constrained PMLE achieves accurate extrapolation for two out of four classifiers: logistic and SVM but failed to converge for the the deep neural network (due to the high test accuracy). The high-dimensional estimator performs well on the multinomial logistic, SVM, and deep neural network classifiers. All three methods beat the benchmark (taking the test accuracy at 20) for the first four classifiers; however, the benchmark is the best estimator for the deep neural network, similarly to what we observe in the simulation (albeit with a shallow network rather than a deep network.)
|Deep neural net||0.995||0.986||0.973||(*)||0.957|
Empirical results indicate that our methods generalize beyond generative classifiers. A possible explanation is that since the Bayes classifier is generative, any classifier which approximates the Bayes classifier is also ‘approximately generative.’ However, an important caveat is that the classifier must already attain close to the Bayes accuracy on the smaller subset of classes. If the classifier is initially far from the Bayes classifier, and then becomes more accurate as more classes are added, our theory could underestimate the accuracy on the larger subset. This is a non-issue for generative classifiers when the training data per class is fixed, since a generative classifier approximates the Bayes rule if and only if the single-class classification function approximates the Bayes optimal single-class classification function. On the other hand, for classifiers with built-in model selection or representation learning, it is expected that the classification functions become more accurate, in the sense that they better approximate a monotonic function of the Bayes classification functions, as data from more classes is added.
Our results are still too inconclusive for us to recommend the use of any of these estimators in practice. Theoretically, it still remains to derive confidence bounds for the generative case; practically, additional experiments are needed to establish the reliability of these estimators in specific applications. There also remains plenty of room for new and improved estimators in this area: for instance, fixing the instability of the constrained pseudolikelihood estimator when the test accuracy is high.
We thank John Duchi, Steve Mussmann, Qingyun Sun, Jonathan Taylor, Trevor Hastie, Robert Tibshirani for useful discussion. CZ is supported by an NSF graduate research fellowship.
 Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). “Identifying natural images from human brain activity.” Nature, 452(March), 352-355.
 Deng, J., Berg, A. C., Li, K., & Fei-Fei, L. (2010). “What does classifying more than 10,000 image categories tell us?” Lecture Notes in Computer Science, 6315 LNCS(PART 5), 71-84.
 Garfield, S., Stefan W., & Devlin, S. (2005). “Spoken language classification using hybrid classifier combination." International Journal of Hybrid Intelligent Systems 2.1: 13-33.
 Anonymous, A. (2016). “Estimating mutual information in high dimensions via classification error.” Submitted to NIPS 2016.
 Tewari, A., & Bartlett, P. L. (2007). “On the Consistency of Multiclass Classification Methods.”
Journal of Machine Learning Research
Journal of Machine Learning Research, 8, 1007-1025.
 Hastie, T., Tibshirani, R., & Friedman, J., (2008). The elements of statistical learning. Vol. 1. Springer, Berlin: Springer series in statistics.
 Arnold, Barry C., & Strauss, D. (1991). “Pseudolikelihood estimation: some examples." Sankhya: The Indian Journal of Statistics, Series B: 233-243.
 Cox, D.R., & Hinkley, D.V. (1974). Theoretical statistics. Chapman and Hall. ISBN 0-412-12420-3
 Lawson, C. L., & Hanson, R. J. (1974). Solving least squares problems. Vol. 161. Englewood Cliffs, NJ: Prentice-hall.
 Hong, J., Mohan, K. & Zeng, D. (2014). “CVX. jl: A Convex Modeling Environment in Julia."
 Domahidi, A., Chu, E., & Boyd, S. (2013). "ECOS: An SOCP solver for embedded systems." Control Conference (ECC), 2013 European. IEEE.
 Achanta, R., & Hastie, T. (2015) "Telugu OCR Framework using Deep Learning." arXiv preprint arXiv:1509.05962 .
 Achanta, R., & Hastie, T. (2015) "Telugu OCR Framework using Deep Learning." arXiv preprint arXiv:1509.05962 .