Many machine learning tasks are interested in recognizing or identifying an individual instance within a large set of possible candidates. These problems are usually modeled as multi-class classification problems, with a large and possibly complex label set. Leading examples include detecting the speaker from his voice patterns(Togneri and Pullella, 2011), identifying the author from her written text (Stamatatos et al., 2014), or labeling the object category from its image (Duygulu et al., 2002, Deng et al., 2010, Oquab et al., 2014). In all these examples, the algorithm observes an input , and uses the classifier function to guess the label from a large label set .
There are multiple practical challenges in developing classifiers for large label sets. Collecting high quality training data is perhaps the main obstacle, as the costs scale with the number of classes. It can be affordable to first collect data for a small set of classes, even if the long-term goal is to generalize to a larger set. Furthermore, classifier development can be accelerated by training first on fewer classes, as each training cycle may require substantially less resources. Indeed, due to interest in how small-set performance generalizes to larger sets, such comparisons can found in the literature (Oquab et al., 2014, Griffin et al., 2007). A natural question is: how does changing the size of the label set affect the classification accuracy?
We consider a pair of classification problems on finite label sets: a source task with label set of size , and a target task with a larger label set of size . For each label set , one constructs the classification rule . Supposing that in each task, the test example
has a joint distribution, define the generalization accuracy for label setas
The problem of performance extrapolation is the following: using data from only the source task , predict the accuracy for a target task with a larger unobserved label set .
A natural use case for performance extrapolation would be in the deployment of a facial recognition system. Suppose a system was developed in the lab on a database of individuals. Clients would like to deploy this system on a new larger set of individuals. Performance extrapolation could allow the lab to predict how well the algorithm will perform on the client’s problem, accounting for the difference in label set size.
Extrapolation should be possible when the source and target classifications belong to the same problem domain. In many cases, the set of categories is to some degree a random or arbitrary selection out of a larger, perhaps infinite, set of potential categories . Yet any specific experiment uses a fixed finite set. For example, categories in the classical Caltech-256 image recognition data set (Griffin et al., 2007) were assembled by aggregating keywords proposed by students and then collecting matching images from the web. The arbitrary nature of the label set is even more apparent in biometric applications (face recognition, authorship, fingerprint identification) where the labels correspond to human individuals (Togneri and Pullella, 2011, Stamatatos et al., 2014). In all these cases, the number of the labels used to define a concrete data set is therefore an experimental choice rather than a property of the domain. Despite the arbitrary nature of these choices, such data sets are viewed as representing the larger problem of recognition within the given domain, in the sense that success on such a data set should inform performance on similar problems.
In this paper, we assume that both and are independent identically distributed (i.i.d.) samples from a population (or prior distribution) of labels , which is defined on the label space
. These assumptions help concretely analyze the generalization accuracy, although both are only approximate characterizations of the label selection process, which is often at least partially manual. Since we assume the label set is random, the generalization accuracy of a given classifier becomes a random variable. Performance extrapolation then becomes the problem of estimating the average generalization accuracyof an i.i.d. label set of size . The condition of i.i.d. sampling of labels ensures that the separation of labels in a random set can be inferred by looking at the empirical separation in , and therefore that some estimate of the average accuracy on can be obtained. We also make the assumption that the classifiers train a separate model for each class. This convenient property allows us to characterize the accuracy of the classifier by selectively conditioning on one class at a time.
Our paper presents two main contributions related to extrapolation. First, we present a theoretical formula describing how average accuracy for smaller is linked to average accuracy for label set of size . We show that accuracy at any size depends on a discriminability function , which is determined by properties of the data distribution and the classifier. Second, we propose an estimation procedure that allows extrapolation of the observed average accuracy curve from -class data to a larger number of classes, based on the theoretical formula. Under certain conditions, the estimation method has the property of being an unbiased estimator of the average accuracy.
The paper is organized as follows. In the rest of this section, we discuss related work. The framework of randomized classification is introduced in Section 2, and there we also introduce a toy example which is revisited throughout the paper. Section 3 develops our theory of extrapolation, and Section 3.3 we suggest an estimation method. We evaluate our method using simulations in Section 4. In Section 5, we demonstrate our method on a facial recognition problem, as well as an optical character recognition problem. In Section 6 we discuss modeling choices and limitations of our theory, as well as potential extensions.
1.1 Related Work
Linking performance between two different but related classification tasks can be considered an instance of transfer learning(Pan and Yang, 2010). Under Pan and Yang’s terminology, our setup is an example of multi-task learning, because the source task has labeled data, which is used to predict performance on a target task that also has labeled data. Applied examples of transfer learning from one label set to another include Oquab et al. (2014), Donahue et al. (2014), Sharif Razavian et al. (2014). However, there is little theory for predicting the behavior of the learned classifier on a new label set. Instead, most research classification for large label sets deal with the computational challenges of jointly optimizing the many parameters required for these models for specific classification algorithms (Crammer and Singer, 2001, Lee et al., 2004, Weston and Watkins, 1999). Gupta et al. (2014) presents a method for estimating the accuracy of a classifier which can be used to improve performance for general classifiers, but doesn’t apply for different set sizes.
The theoretical framework we adopt is one where there exists a family of classification problems with increasing number of classes. This framework can be traced back to Shannon (1948)
, who considered the error rate of a random codebook, which is a special case of randomized classification. More recently, a number of authors have considered the problem of high-dimensional feature selection for multiclass classification with a large number of classes(Pan et al., 2016, Abramovich and Pensky, 2015, Davis et al., 2011). All of these works assume specific distributional models for classification compared to our more general setup. However, we do not deal with the problem of feature selection.
Perhaps the most similar method that deals with extrapolation of classification error to a larger number of classes can be found in Kay et al. (2008)
. They trained a classifier for identifying the observed stimulus from a functional MRI scan of brain activity, and were interested in its performance on larger stimuli sets. They proposed an extrapolation algorithm as a heuristic with little theoretical discussion. In Section4.1 we interpret their method within our theory, and discuss cases where it performs well compared to our algorithm.
2 Randomized Classification
The randomized classification model we study has the following features. We assume that there exists an infinite, perhaps continuous, label space and a example space . We assume there exists a prior distribution on the label space . And for each label , there exists a distribution of examples . In other words, for an example-label pair , the conditional distribution of given is given by .
A random classification task can be generated as follows. The label set is generated by drawing labels i.i.d. from . For each label, we sample a training set and a test set. The training set is obtained by sampling observations i.i.d. from for and . The test set is likewise obtained by sampling observations i.i.d. from for .
We assume that the classifier works by assigning a score to each label , then choosing the label with the highest score. That is, there exist real-valued score functions for each label . Since the classifier is allowed to depend on the training data, it is convenient to view it (and its associated score functions) as random. We write when we wish to work with the classifier as a random function, and likewise to denote the score functions whenever they are considered as random.
For a fixed instance of the classification task with labels and associated score functions , recall the definition of the -class generalization error (1). Assuming that there are no ties, it can be written in terms of score functions as
where for . However, when we consider the labels and associated score functions to be random, the generalization accuracy also becomes a random variable.
Suppose we specify but do not fix any of the random quantities in the classification task. Then the -class average generalization accuracy of a classifier is the expected value of the generalization accuracy resulting from a random set of labels, , and their associated score functions,
The last line follows from noting that all summands in the previous line are identical. The definition of average generalization accuracy is illustrated in Figure 1.
2.1 Marginal Classifier
In our analysis, we do not want the classifier to rely too strongly on complicated interactions between the labels in the set. We therefore propose the following property of marginal separability for classification models:
The classifier is called a marginal classifier if the score function only depends on the label and the class training set ; that is, for some function ,
This means that the score function for does not depend on other labels or their training samples. Therefore, each can be considered to have been drawn from a distribution . Classes “compete” only through selecting the highest score, but not in constructing the score functions. The operation of a marginal classifier is illustrated in Figure 2.
The marginal property allows us to prove strong results about the accuracy of the classifier under i.i.d. sampling assumptions.
If is a marginal classifier then is independent of and for .
Estimated Bayes classifiers are primary examples of marginal classifiers. Let be a density estimate of the example distribution under label obtained from the empirical distribution . Then, we can use the estimated density to produce the score functions:
The resulting empirical approximation for the Bayes classifier would be
Both Quadratic Discriminant Analysis (QDA) and naive Bayes classifiers can be seen as specific instances of an estimated Bayes classifier.111QDA is the special case of the estimated Bayes classifier when is obtained as the multivariate Gaussian density with mean and covariance parameters estimated from the data. Naive Bayes is the estimated Bayes classifier when is obtained as the product of estimated componentwise marginal distributions of . For QDA, the score function is given by
where and . In Naive Bayes, the score function is
where is a density estimate for the -th component of .
For some classifiers, is a deterministic function of (and therefore is degenerate). A prime example is when there exist fixed or pre-trained embeddings that map labels and examples into . Then
Notational remark. Henceforth, we shall relax the assumption that the classifier is based on a training set. Instead, we assume that there exist score functions associated with the random label set , and that the score functions are independent of the test set. The classifier is marginal if and only if are independent of both and for .
2.2 Estimation of Average Accuracy
Before tackling extrapolation, it is useful to discuss a simpler task of generalizing accuracy results when the target set is not larger than the source set. Suppose we have test data for a classification task with classes. That is, we have a label set and its associated set of score functions , as well as test observations for . What would be the predicted accuracy for a new randomly sampled set of labels?
Note that is the expected value of the accuracy on the new set of labels. Therefore, any unbiased estimator of will be an unbiased predictor for the accuracy on the new set.
Let us start with the case . For each test observation , define the ranks of the candidate classes by
The test accuracy is the fraction of observations for which the correct class also has the highest rank
Taking expectations over both the test set and the random labels, the expected value of the test accuracy is . Therefore, in this special case, provides an unbiased estimator for .
Next, let us consider the case where . Consider label set obtained by sampling labels uniformly without replacement from . Since is unconditionally an i.i.d. sample from the population of labels , the test accuracy of is an unbiased estimator of . However, we can get a better unbiased estimate of by averaging over all the possible subsamples . This defines the average test accuracy over subsampled tasks, .
Remark. Naïvely, computing requires us to train and evaluate classification rules. However, for marginal classifiers, retraining the classifier is not necessary. Looking at the rank of the correct label for , allows us to determine how many subsets will result in a correct classification. Specifically, there are labels with a lower score than the correct label . Therefore, as long as one of the classes in is , and the other labels are from the set of labels with lower score than , the classification of will be correct. This implies that there are such subsets where is classified correctly, and therefore the average test accuracy for all subsets is
2.3 Toy Example: Bivariate Normal
Let us illustrate these ideas using a toy example. Let have a bivariate normal joint distribution,
as illustrated in Figure 3(a). Therefore, for a given randomly drawn label , the conditional distribution of for that label is univariate normal with mean
Supposing we draw labels , the classification problem will be to assign a test instance to the correct label. The test instance
would be drawn with equal probability from one of three conditional distributions, as illustrated in Figure 3(b, top). The Bayes rule assigns to the class with the highest density , as illustrated by Figure 3(b, bottom): it is therefore a marginal classifier, with score function
|Joint distribution of||Problem instance with|
For this model, the generalization accuracy of the Bayes rule for any label set is given by
where is the standard normal cdf, are the sorted labels, and . We numerically computed for randomly drawn labels , and the distributions of for are illustrated in Figure 4. The mean of the distribution of is the -class average accuracy, . The theory presented in the next section deals with how to analyze the average accuracy as a function of .
The section is organized as follows. We begin by introducing an explicit formula for the average accuracy . The formula reveals that is determined by moments of a one-dimensional function . Using this formula, we can estimate using subsampled accuracies. These estimates allow us to extrapolate the average generalization accuracy to an arbitrary number of labels.
The result of our analysis is to expose the average accuracy as the weighted average of a function , where is independent of , and where only changes the weighting. The result is stated as follows.
Suppose , , and score functions satisfy the tie-breaking condition. Then, there exists a cumulative
satisfy the tie-breaking condition. Then, there exists a cumulative distribution functiondefined on the interval such that
The tie-breaking allows us to neglect specifying the case when margins are tied.
Tie-breaking condition: for all , with probability one for independently drawn from .
In practice, one can simply break ties randomly, which is mathematically equivalent to adding a small amount of random noise to the function .
3.1 Analysis of Average Accuracy
For the following discussion, we often consider a random label with its associated score function and example vector. Explicitly, this sampling can be written:
Similarly we use and for two more triplets with independent and identical distributions. Specifically, will typically note the test example, and therefore the true label and its score function.
The function is related to a favorability function. Favorability measures the probability that the score for the example is going to be maximized by a particular score function , compared to a random competitor . Formally, we write
Note that for fixed example , favorability is monotonically increasing in . If , then , because the event contains the event .
Therefore, given labels and test instance , we can think of the classifier as choosing the label with the greatest favorability:
Furthermore, via a conditioning argument, we see that this is still the case even when the test instance and labels are random:
The favorability takes values between 0 and 1, and when any of its arguments are random, it becomes a random variable with a distribution supported on . In particular, we consider the following two random variables:
the incorrect-label favorability between a given fixed test instance , and the score function of a random incorrect label , and
the correct-label favorability between a random test instance , and the score function of the correct label, .
3.1.1 Incorrect-Label Favorability
The incorrect-label favorability can be written explicitly as
Note that and are identically distributed, and are both are unrelated to that is fixed. This leads to the following result:
Under the tie-breaking condition, the incorrect-label favorability
is uniformly distributed for any
is uniformly distributed for any, meaning
Write , where and for . The tie-breaking condition implies that . Now observe that for independent random variables with and , the conditional probability is uniformly distributed. ∎
3.1.2 Correct-Label Favorability
The correct-label favorability is
The distribution of will depend on , and , and generally cannot be written in a closed form. However, this distribution is central to our analysis–indeed, we will see that the function appearing in theorem 1 is defined as the cumulative distribution function of .
The special case of shows the relation between the distribution of and the average generalization accuracy, . In the two-class case, the average generalization accuracy is the probability that a random correct label score function gives a larger value than a random distractor:
where is the correct label, and is a random incorrect label. If we condition on , and , we get
Here, the conditional probability inside the expectation is the correct-label favorability. Therefore,
where is the cumulative distribution function of ,
. Theorem 1
extends this to general ; we now give the proof.
Without loss of generality, suppose that the true label is and the incorrect labels are . We have
recalling that . Now, if we condition on , and , then the random variable becomes fixed, with value
Now define . Since by Lemma 1, are i.i.d. uniform conditional on , we know that
Furthermore, is independent of conditional on . Therefore, the conditional probability can be computed as
By defining as the cumulative distribution function of on ,
Theorem 1 expresses the average accuracy as a weighted integral of the function . Essentially, this theoretical result allows us to reduce the problem of estimating to one of estimating . But how shall we estimate from data? We propose using non-parametric regression for this purpose in Section 3.3.
3.2 Favorability and Average Accuracy for the Toy Example
Recall that for the toy example from Section 2.3, the score function was a non-random function of that measures the distance between and
For this model, the favorability function compares the distance between and to the distance between and for a randomly chosen distractor :
where is the standard normal cumulative distribution function. Figure 5(a) illustrates the level sets of the function . The highest values of are near the line corresponding to the conditional mean of , and as one moves farther from the line, decays. Note, however, that large values of and (with the same sign) result in larger values of since it becomes unlikely for to exceed .
Using the formula above, we can calculate the correct-label favorability and its cumulative distribution function . The function is illustrated in Figure 5(b) for the current example with . The red curve in Figure 4 was computed using the formula
It is illuminating to consider how the average accuracy curves and the functions vary as we change the parameter . Higher correlations lead to higher accuracy, as seen in Figure 6(a), where the accuracy curves are shifted upward as increases from 0.3 to 0.9. The favorability tends to be higher on average as well, which leads to lower values of the cumulative distribution function–as we see in Figure 6(b), where the function becomes smaller as increases.
Next, we discuss how to use data from smaller classification tasks to extrapolate average accuracy. Assume that we have data from a -class random classification task, and would like to estimate the average accuracy for classes. Our estimation method will use the -class average test accuracies, (see Eq 4), for its inputs.
The key to understanding the behavior of the average accuracy is the function . We adopt a linear model
where are known basis functions, and are the linear coefficients to be estimated. Since our proposed method is based on the linearity assumption (17), we refer to it as ClassExReg, meaning Classification Extrapolation using Regression.
The constants are moments of the basis function . Note that can be precomputed numerically for any .
Now, since the test accuracies are unbiased estimates of , this implies that the regression estimate
is unbiased for . The estimate of is similarly obtained from (20), via
3.4 Model Selection
Accurate extrapolation using ClassExReg depends on a good fit between the linear model (17) and the true discriminability function . However, since the function depends on the unknown joint distribution of the data, it makes sense to let the data help us choose a good basis from a set of candidate bases.
Let be a set of candidate bases, with . Ideally, we would like our model selection procedure to choose the that obtains the best root-mean-squared error (RMSE) on the extrapolation from to classes. As an approximation, we estimate the RMSE of extrapolation from source classes to target classes, by means of the “bootstrap principle.” This amounts to a resampling-based model selection approach, where we perform extrapolations from classes to classes, and evaluate methods based on how closely the predicted matches the test accuracy . To elaborate, our model selection procedure is as follows.
For resampling steps:
Subsample from uniformly with replacement.
Compute average test accuracies from the subsample .
For each candidate basis , with :
Compute by solving the least-squares problem
Select the basis by
Use the basis to extrapolate from classes (the full data) to classes.
4 Simulation Study
We ran simulations to check how the proposed extrapolation method, ClassExReg, performs in different settings. The results are displayed in Figure 7. We varied the number of classes
in the source data set, the difficulty of classification, and the basis functions. We generated data according to a mixture of isotropic multivariate Gaussian distributions: labelswere sampled from , and the examples for each label sampled from . The noise-level parameter determines the difficulty of classification. Similarly to the real-data example, we consider a 1-nearest neighbor classifier, which is given a single training instance per class.
For the estimation, we use the model selection procedure described in section 3.4 to select the parameter of the “radial basis”
where are a set of regularly spaced knots which are determined by and the problem parameters. Additionally, we add a constant element to the basis, equivalent to adding an intercept to the linear model (17).
The rationale behind the radial basis is to model the density of as a mixture of gaussian kernels with variance . To control overfitting, the knots are separated by at least a distance of , and the largest knots have absolute value The size of the maximum knot is set this way since is the number of ranks that are calculated and used by our method. Therefore, we do not expect the training data to contain enough information to allow our method to distinguish between more than possible accuracies, and hence we set the maximum knot to prevent the inclusion of a basis element that has on average a higher mean value than . However, in simulations we find that the performance of the basis depends only weakly on the exact positioning and maximum size of the knots, as long as sufficiently large knots are included. As is the case throughout non-parametric statistics, the bandwidth is the most crucial parameter. In the simulation, we use a grid for bandwidth selection.
4.1 Comparison to Kay
In their paper,222The KDE extrapolation method is described in page 29 of supplement to Kay et al. (2008). While the method is only described for a one-nearest neighbor classifier and for the setting where there is at most one test observation per class, we have taken the liberty of extending it to a generic multi-class classification problem. Kay et al. (2008)
proposed a method for extrapolating classification accuracy to a larger number of classes. The method depends on repeated kernel-density estimation (KDE) steps. Because the method is only briefly motivated in the original text, we present it in our notation.
For observed classes, let be the observed score comparing feature vector of the ’th test example of the ’th class to the model trained for the ’th class . For each feature-vector , the density of wrong-class scores is estimated by smoothing the observed scores with a kernel function with bandwidth ,