## 1 Introduction

In multi-class classification, one observes pairs where

are feature vectors, and

are unknown labels, which lie in a countable label set . The goal is to construct a classification rule for predicting the label of a new data point; generally, the classification ruleis learned from previously observed data points. In many applications of multi-class classification, such as face recognition or image recognition, the space of potential labels is practically infinite. In such a setting, one might consider a sequence of classification problems on finite label subsets

, where in the -th problem, one constructs the classification rule . Supposing thathave a joint distribution, define the accuracy for the

-th problem asUsing data from only , can one predict the accuracy achieved on the larger label set , with ? This is the problem of *performance extrapolation*.

A practical instance of performance extrapolation occurs in neuroimaging studies, where the number of classes is limited by experimental considerations. Kay et al. [1] obtained fMRI brain scans which record how a single subject’s visual cortex responds to natural images. The label set corresponds to the space of all grayscale photographs of natural images, and the set is a subset of 1750 photographs used in the experiment. They construct a classifier which achieves over 0.75 accuracy for classifying the 1750 photographs; based on exponential extrapolation, they estimate that it would take on the order of photographs before the accuracy of the model drops below 0.10! Directly validating this estimate would take immense resources, so it would be useful to develop the theory needed to understand how to compute such extrapolations in a principled way.

However, in the fully general setting, it is impossible on construct
non-trivial bounds on the accuracy achieved on the new classes
based only on knowledge of : after all, could consist entirely of well-separated classes
while the new classes consist entirely of highly inseparable classes, or vice-versa.
Thus, the most important assumption for our theory is that of *exchangeable sampling*.
The labels in are assumed to be an exchangeable sample from .
The condition of exchangeability ensures that the separability of random subsets of can be inferred
by looking at the empirical distributions in , and therefore that some estimate of the achievable
accuracy on can be obtained.

The assumption of exchangeability greatly limits the scope of application for our methods. Many multi-class classification problems have a hierarchical structure [2], or have class labels distributed according to non-uniform discrete distributions, e.g. power laws [3]; in either case, exchangeability is violated. It would be interesting to extend our theory to the hierarchical setting, or to handle non-hierarchical settings with non-uniform prior class probabilities, but again we leave the subject for future work.

In addition to the assumption of exchangeability, we consider a restricted set of classifiers.
We focus on *generative classifiers*, which are classifiers that work by training
a model separately on each class. This convenient property
allows us to characterize the accuracy of the classifier by selectively conditioning on one class at a time.
In section 3, we use this technique to reveal an equivalence between
the expected accuracies of

to moments of a common distribution. This moment equivalence result allows standard approaches in statistics, such as U-statistics and nonparametric pseudolikelilood, to be directly applied to the extrapolation problem, as we discuss in section 4. In non-generative classifiers, the classification rule has a joint dependence on the entire set of classes, and cannot be analyzed by conditioning on individual classes. In section 5, we empirically study the performance of our classifiers. Since generative classifiers only comprise a minority of the classifiers used in practice, we applied our methods to a variety of generative and non-generative classifiers in simulations and in one OCR dataset. Our methods have varying success on generative and non-generative classifiers, but seem to work badly for neural networks.

*Our contribution.*

To our knowledge, we are the first to formalize the problem of prediction extrapolation. We introduce three methods for prediction extrapolation: the method of extended unbiased estimation and the constrained pseudolikelihood method are novel. The third method, based on asymptotics, is a new application of a recently proposed method for estimating mutual information [4].

## 2 Setting

Having motivated the problem of performance extrapolation, we now reformulate the problem for notational and theoretical convenience. Instead of requiring to be a random subset of as we did in section 1, take and . We fix the size of without losing generality, since any monotonic sequence of finite subsets can be embedded in a sequence with . In addition, rather than randomizing the labels, we will randomize the marginal distribution of each label; Towards that end, let be a space of feature vectors, and let

be a measurable space of probability distributions on

. Let be a probability measure on , and let be an infinite sequence of i.i.d. draws from . We refer to , a probability measure on probability measures, as a*meta-distribution*. The distributions are the marginal distributions of the first classes. Further assuming that the labels are equiprobable, we rewrite the accuracy as

where the probabilities are taken over .

In order to construct the classification rule , we need data from the classes .
In most instances of multi-class classification, one observes independent observations from each
which are used to construct the classifier. Since the order of the observations
does not generally matter, a sufficient statistic for the training data for the -th classification problem
is the collection of empirical distributions
for each class.
Henceforth, we make the simplifying assumption that the training data for the -th class remains fixed
from , so we drop the superscript on .
Write for the conditional distribution of given ;
also write for the marginal distribution of when
As an example, suppose every class has the number of training examples ; then
is the empirical distribution of i.i.d. observations from , and is the *empirical meta-distribution* of .
Meanwhile, is the true meta-distribution of the empirical distribution of i.i.d. draws from a random .

### 2.1 Multiclass classification

Extending the formalism of Tewari and Bartlett [5]^{1}^{1}1As in their framework,
we define a classifier as a vector-valued function. However, we introduce the notion of a classifier as a multiple-argument functional on empirical distributions, which echoes the functional formulation of estimators common in the statistical literature.,
we define a classifier as a collection of mappings
called *classification functions.*
Intuitively speaking, each classification function *learns a model* from the first arguments, which are
the empirical marginals of the classes, . For each class, the classifier assigns a real-valued *classification score* to the *query point* . A higher score indicates a higher estimated probability that belongs to the -th class.
Therefore, the classification rule corresponding to a classifier assigns
a class with maximum classification score to :

For some classifiers, the classification functions are especially simple in that is only a function of and . Furthermore, due to symmetry, in such cases one can write

where is called a *single-class classification function* (or simply *classification function*),
and we say that is a *generative classifier*.
Quadratic discriminant analysis and Naive Bayes [6] are two examples of
generative classifiers^{2}^{2}2For QDA, the classification function is given by

*generative*property allows us to prove strong results about the accuracy of the classifier under the exchangeable sampling assumption, as we see in Section 3.

## 3 Performance extrapolation for generative classifiers

Let us specialize to the case of a generative classifier, with classification function . Consider estimating the expected accuracy for the -th classification problem,

(1) |

In the case of a generative classifier, we have

Define the *conditional accuracy* function which maps a
distribution on and a *test* observation to
a real number in . The conditional accuracy gives the
probability that for independently drawn from , that
will be greater than :

Define the *conditional accuracy* distribution as the law
of where and are generated as follows:
(i) a true distribution is drawn from ;
(ii) the empirical distribution is drawn from (i.e., the training data for the class),
(iii) the query is drawn from , with independent of (i.e. a single test data point from the same class.)
The significance of the conditional accuracy
distribution is that the expected accuracy can be
written in terms of its moments.

Theorem 3.1. *
Let be a single-distribution classification function, and let , be a distribution on
Further assume that
and jointly satisfy the
tie-breaking property:*

(2) |

*for all , where .
Let *

*be defined as the random variable
*

*for , , and with . Then*

*where is the expected accuracy as defined by (1).
*

Proof. Write . By using conditioning and conditional independence, can be written

Theorem 3.1 tells us that the problem of extrapolation can be approached by attempting to estimate the conditional accuracy distribution. The -th moment of gives us , which will in turn be a good estimate of .

While is not directly observed, we can obtain unbiased estimates of by using test data. For any , and independent test point , define

(3) |

Then is an unbiased estimate of , as stated in the following theorem.

Theorem 3.2.*
Assume the conditions of theorem 3.1.
Then defining*

(4) |

*we have*

*Hence,*

In section 4, we will use this result to estimate the moments of .
Meanwhile, since is a random variable on , we also conclude that follows a *mixed exponential decay*.
Let be the law of .
Then from change-of-variables , we get

This fact immediately suggests the technique of fitting a mixture of exponentials to the test accuracy at : we explore this idea further in Section 4.1.

### 3.1 Properties of the conditional accuracy distribution

The conditional accuracy distribution is determined by
and . What can we say about the the conditional accuracy
distribution without making any assumptions on either or
? The answer is: not much. For an arbitrary probability
measure on , one can construct and
such that the conditional accuracy has the distribution , even if one makes the *perfect sampling assumption* that

Theorem 3.3. * Let be defined as in Theorem
3.1, and let denote the law of . Then, for any probability
distribution on , one can construct a
meta-distribution and a classification function such
that the conditional accuracy has distribution under perfect sampling (that is, .) *

Proof. Let be the cdf of , , and let . Define by

Let , and define by , and also A straightforward calculation yields that .

On the other hand, we can obtain a positive result if we assume that
the classifier approximates a *Bayes classifier.*
Assuming that is absolutely continuous with respect to Lebesgue measure with probability one,
a Bayes classifier results from assuming perfect sampling () and taking
.
Theorem 3.4. states that for a Bayes classifier, the measure has a density which is monotonically increasing.
Since a ‘good’ classifier approximates the Bayes classifier, we intuitively expect that a monotonically
increasing density is a good model for the conditional accuracy distribution of a ‘good’ classifier.

Theorem 3.4. * Assume the conditions of theorem 3.1, and further suppose
that , is absolutely continuous with respect to with probability one,
that , and that *

*has a regular conditional probability distribution.
Let *

*denote the law of . Then has a density on which is monotonic in .*

Proof. It suffices to prove that

for all and . Let denote the space of distributions supported on which are absolutely continuous with respect to -dimensional Lebesgue measure . Let denote the marginal distribution of for with . Define the set

for all One can verify that for all ,

using the fact that has no atoms. Hence, we obtain

Taking , we conclude the theorem.

## 4 Estimation

Suppose we have independent test repeats per class, . Let us define

which coincides with the definition (4) in the special case that is generative.

At a high level, we have a hierarchical model where is drawn from a distribution on and then . Let us assume that has a density : then the marginal distribution of can be written

However, the observed do *not* comprise an i.i.d. sample.

We discuss the following three approaches for estimating based on . The first is an extension of *unbiased
estimation* based on binomial U-statistics, which is discussed in
Section 4.1. The second is the *pseudolikelihood* approach. In
problems where the marginal distributions are known, but the
dependence structure between variables is unknown, the
*pseudolikelihood* is defined as the product of the marginal
distributions. For certain problems in time series analysis and
spatial statistics, the maximum pseudolikelihood estimator (MPLE) is
proved to be consistent [7]. We discuss pseudolikelihood-based
approaches in Section 4.2. Thirdly, we note that the high-dimensional
theory of Anon 2016 [4] can be applied for prediction accuracy, which we discuss in Section 4.3.

### 4.1 Extensions of unbiased estimation

If , then an unbiased estimator of exists if and only if .

This result can be immediately applied to yield an unbiased estimator of , when :

(5) |

However, since is undefined for , we can use exponential extrapolation to define an extended estimator for . Let be a measure defined by solving the optimization problem

After discretizing the measure , we obtain a convex optimization problem which can be solved using non-negative least squares [9]. Then define

### 4.2 Maximum pseudolikelihood

Estimated density | Estimated moment | ||

Truth | |||

MPLE | |||

CONS | |||

The (log) pseudolikelihood is defined as

(6) |

and a maximum pseudolikelihood estimator (MPLE) is defined as any density such that

The motivation for is that it consistently estimates in the limit where . However, in finite samples, is not uniquely defined, and if we define the plug-in estimator

can vary over a large range, depending on which is selected. These shortcomings motivate the adoption of additional constraints on the estimator .

Theorem 3.4. motivates the *monotonicity constraint* that .
A second constraint is to restrict the -th moment of to match the unbiased estimate.
The addition of these constraints yields the constrained PMLE
, which is obtained by solving

By discretizing , all of the above maximization problems can be solved using a general-purpose convex solver^{3}^{3}3
We found that the disciplined convex programming language CVX, using the ECOS second-order cone programming solver,
succeeds in optimizing the problems where the dimension of the discretized is as large as 10,000 [10, 11]..
While the added constraints do not guarantee a unique solution,
they improve estimation of and thus improve moment estimation (Figure 1.)

### 4.3 High-dimensional asymptotics

Under a number of conditions on the distribution , including (but not limited to) having a large dimension , Anon [4] relate the accuracy of the Bayes classifier to the mutual information between the label and the response :

where

While our goal is not to estimate the mutual information, we note that the results of Anon 2016 imply a relationship between and for the Bayes accuracy under the high-dimensional regime:

Therefore, under the high-dimensional conditions of [4] and assuming that the classifier approximates the Bayes classifier, we naturally obtain the following estimator

## 5 Results

We applied the methods described in Section 4 on a simulated gaussian mixture (Figure 2) and on a Telugu character classification task [12] (Table 1.)

For the simulated gaussian mixture, we vary the size of the initial subset from classes to

classes, and extrapolate the performance for gaussian mixture model, multinomial logistic, and one-layer neural network (with 10 sigmoidal units.) Figure 3 shows how the predicted

-class accuracy changes as is varied. We see that the predicted accuracy curves for QDA and Logistic have similar behavior, even though QDA is generative and multinomial logistic is not. All three methods perform better on QDA and logistic classifiers than on the neural network: in fact, for the neural network, the test accuracy of the initial set, , becomes a better estimator of than the three proposed methods for most of the curve. We also see that the exponential extrapolation method, , is more variable than constrained pseudolikelihood and high-dimensional estimator . Additional simulation results can be found in the supplement.In the character classification task, we predict the 400-class accuracy of naive Bayes, multinomial logistic regression, SVM [6],

-nearest neighbors^{4}

^{4}4-nearest neighbors with for fixed , and deep neural networks

^{5}

^{5}5The network architecture is as follows: 48x48-4C3-MP2-6C3-8C3-MP2-32C3-50C3-MP2-200C3-SM. 48x48 binary input image, C3 is a 3x3 convolutional layer with

output maps, MP2 is a 2x2 max-pooling layer, and SM is a softmax output layer on 20 or 400 classes.

using 20-class data with 103 training examples per class (Table 1). Taking the test accuracy on 400 classes (using 50 test examples per class) as a proxy for , we compare the performance of the three extrapolation methods; as a benchmark, also consider using the test accuracy on 20 classes as an estimate. The exponential extrapolation method performs well only for the deep neural network. Meanwhile, constrained PMLE achieves accurate extrapolation for two out of four classifiers: logistic and SVM but failed to converge for the the deep neural network (due to the high test accuracy). The high-dimensional estimator performs well on the multinomial logistic, SVM, and deep neural network classifiers. All three methods beat the benchmark (taking the test accuracy at 20) for the first four classifiers; however, the benchmark is the best estimator for the deep neural network, similarly to what we observe in the simulation (albeit with a shallow network rather than a deep network.)QDA | Logistic | Neural Net | ||

Classifier | Test | Test | |||
---|---|---|---|---|---|

Naive Bayes | 0.947 | 0.601 | 0.884 | 0.659 | 0.769 |

Logistic | 0.922 | 0.711 | 0.844 | 0.682 | 0.686 |

SVM | 0.860 | 0.545 | 0.737 | 0.473 | 0.546 |

-NN | 0.964 | 0.591 | 0.895 | 0.395 | 0.839 |

Deep neural net | 0.995 | 0.986 | 0.973 | (*) | 0.957 |

## 6 Discussion

Empirical results indicate that our methods generalize beyond generative classifiers.
A possible explanation is that since the Bayes classifier is generative,
any classifier which approximates the Bayes classifier is also ‘approximately generative.’
However, an important caveat is that the classifier must already attain close to the Bayes accuracy
on the smaller subset of classes. If the classifier is initially far from the Bayes classifier,
and then becomes more accurate as more classes are added, our theory could underestimate the
accuracy on the larger subset. This is a non-issue for generative classifiers when the training data per class is fixed,
since a generative classifier approximates the Bayes rule if and only if the single-class classification function approximates the
Bayes optimal single-class classification function. On the other hand, for classifiers with built-in *model selection*
or *representation learning*, it is expected that the classification functions become more accurate,
in the sense that they better approximate a monotonic function of the Bayes classification functions,
as data from more classes is added.

Our results are still too inconclusive for us to recommend the use of any of these estimators in practice. Theoretically, it still remains to derive confidence bounds for the generative case; practically, additional experiments are needed to establish the reliability of these estimators in specific applications. There also remains plenty of room for new and improved estimators in this area: for instance, fixing the instability of the constrained pseudolikelihood estimator when the test accuracy is high.

#### Acknowledgments

We thank John Duchi, Steve Mussmann, Qingyun Sun, Jonathan Taylor, Trevor Hastie, Robert Tibshirani for useful discussion. CZ is supported by an NSF graduate research fellowship.

## References

[1] Kay, K. N., Naselaris, T., Prenger, R. J., & Gallant, J. L. (2008). “Identifying natural images from human brain activity.”
*Nature*, 452(March), 352-355.

[2] Deng, J., Berg, A. C., Li, K., & Fei-Fei, L. (2010). “What does classifying more than 10,000 image categories tell us?” *Lecture Notes in Computer Science*, 6315 LNCS(PART 5), 71-84.

[3] Garfield, S., Stefan W., & Devlin, S. (2005). “Spoken language classification using hybrid classifier combination."
*International Journal of Hybrid Intelligent Systems* 2.1: 13-33.

[4] Anonymous, A. (2016). “Estimating mutual information in high dimensions via classification error.” Submitted to
*NIPS 2016.*

[5] Tewari, A., & Bartlett, P. L. (2007). “On the Consistency of Multiclass Classification Methods.”

*Journal of Machine Learning Research*

[6] Hastie, T., Tibshirani, R., & Friedman, J., (2008). *The elements
of statistical learning.* Vol. 1. Springer, Berlin: Springer series in
statistics.

[7] Arnold, Barry C., & Strauss, D. (1991). “Pseudolikelihood estimation: some examples." *Sankhya: The Indian Journal of Statistics, Series B*: 233-243.

[8] Cox, D.R., & Hinkley, D.V. (1974). *Theoretical statistics.* Chapman and Hall. ISBN 0-412-12420-3

[9] Lawson, C. L., & Hanson, R. J. (1974). *Solving least squares problems.* Vol. 161. Englewood Cliffs, NJ: Prentice-hall.

[10] Hong, J., Mohan, K. & Zeng, D. (2014). “CVX. jl: A Convex Modeling Environment in Julia."

[11] Domahidi, A., Chu, E., & Boyd, S. (2013). "ECOS: An SOCP solver for embedded systems." *Control Conference (ECC), 2013 European. IEEE.*

[12] Achanta, R., & Hastie, T. (2015) "Telugu OCR Framework using Deep Learning." arXiv preprint arXiv:1509.05962 .

Comments

There are no comments yet.