The Shape of Learning Curves: a Review

03/19/2021 ∙ by Tom Viering, et al. ∙ 0

Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The more often we are confronted with a particular problem to solve, the better we typically get at it. The same goes for machines. A learning curve is an important, graphical representation that can provide insight into such learning behavior by plotting generalization performance against the number of training examples.

We review learning curves in the context of standard supervised learning problems such as classification and regression. The primary focus is on the shapes that learning curves can take on. We make a distinction between well-behaved learning curves that show improved performance with more data and ill-behaved learning curves that, perhaps surprisingly, do not. We discuss theoretical and empirical evidence in favor of different shapes, underlying assumptions made, how knowledge about those shapes can be exploited, and further results of interest. In addition, we provide the necessary background to interpret and use learning curves as well as a comprehensive overview of the important research directions.

1.1 Outline

The next section starts off with a definition of learning curves and discusses how to estimate them in practice. It also briefly considers so-called feature curves, which offer a complementary view. Section 3 covers the use of learning curves, such as the insight into model selection they can give us, and how they are employed, for instance, in meta-learning and reducing the cost of labeling or computation. Section 4

considers evidence supporting well-behaved learning curves: curves that generally show improved performance with more training data. We review the parametric models that have been studied empirically and cover the theoretical findings in favor of some of these. Many of the more theoretical results in the literature have been derived particularly for Gaussian process regression as its learning curve is more readily analyzed analytically. Section

5 is primarily devoted to those specific results. Section 6 then follows with an overview of important cases of learning curves that do not behave well and considers possible causes. We believe that especially this section shows that our understanding of the behavior of learners is more limited than one might expect. Section 7 provides an extensive discussion. It also concludes our review. The remainder of the current section goes into the origins and meanings of the term “learning curve” and its synonyms.

1.2 Learning Curve Origins and Meanings

With his 1885 book Über das Gedächtnis [1], Ebbinghaus is generally considered to be the first to employ and qualitatively describe learning curves. His curves report on the number of repetitions it takes a human to perfectly memorize an increasing number of meaningless syllables.

While Ebbinghaus is the originator of the learning curve concept, it should be noted that the curves he considered are importantly different from the curves central to this review. In the machine learning setting, we typically care about the generalization performance, i.e., the learner’s performance on

new and unseen data. However, as the aim of Ebbinghaus’s subject is to recite exactly that string of syllables that has been provided to him, this means in a way that the performance on the training set is considered. This last measure is also called the training, resubstitution, or apparent error in classification [2, 3, 4]. Indeed, as most often is the case for the training error as well, memorization performance gets worse as the amount of training data increases, i.e., it is more difficult to memorize an increasing number of syllables.

The term learning curve considered in this review is different from the curve that displays the training error—or the value of any objective function—as a function of the number of epochs or iterations used for optimization. Especially in the neural network literature, this is what the learning curve often signifies

[5, 6]. What it has in common with those of Ebbinghaus is that the performance is plotted against the number of times that (part of) the data has been revisited, which corresponds directly to the number of repetitions in [1]. These curves, used to monitor the optimality of a learner in the training phase, are also referred to as training curves and this terminology can be traced back to [7]. We use training curve exclusively to refer to these curves that are used during training. Many researchers and practitioners, however, use the term learning curve instead of training curve [5], which, at times, may lead to confusion.

In the machine learning literature synonyms for learning curve are error curve, experience curve, improvement curve and generalization curve [5, 8, 9]. Improvement curve can be traced back to a 1897 study, on learning the telegraphic language [10]. Generalization curve was first used in machine learning in 1990 [8, 9]. Decades earlier, the term was already used to indicate the plot of the intensity of an animal’s response against stimulus size [11]. Learning curve or, rather, its German equivalent was not used as an actual term in the original work of Ebbinghaus [1]. The English variant seems to appear 18 years later, in 1903 [12]. Lehrkurve follows a year after [13].

We traced back the first mention of learning curve in connection to learning machines to a discussion in an 1957 issue of Philosophy [14]. A year later, Rosenblatt, in his famous 1958 work [15]

, uses learning curves in the analysis of his perceptron. Following this review’s terminology, he uses this term to refer to a training curve. Foley

[16] was possibly the first to use a learning curve, as it is defined in this review, in an experimental setting such as is common nowadays. The theoretical study of learning curves for supervised learners dates back to 1965 [17].

2 Definition, Estimation, Feature Curves

This section makes the notion of a learning curve more precise and describes how learning curves can be estimated from data. We give some recommendations when it comes to plotting learning curves and summarizing them. Finally, feature curves offer a view on learners that is complementary to that of learning curves. These and combined learning-feature curves are covered at the end of this section.

2.1 Definition of Learning Curves

Let indicate a training set of size , which acts as input to some learning algorithm . In standard classification and regression, consists of pairs, where is the

-dimensional input vector (i.e., the features, measurements, or covariates) and

is the corresponding output (e.g. a class label or regression target). denotes the input space and the output space. The

pairs of the training set are i.i.d. samples of an unknown probability distribution

over . Predictors come from the hypothesis class , which contains all models that can be returned by the learner . An example of a hypothesis class is the set of all linear models .

When is evaluated on a sample , its prediction for the corresponding is given by . The performance of a particular hypothesis

is measured by a loss function

that compares to . Examples are the squared loss for regression, where and and the zero-one loss for (binary) classification when .

The typical goal is that our predictor performs well on average on all new and unseen observations. Ideally, this is measured by the expected loss or risk over the true distribution :

(1)

Here, as in most that follows, we omit the subscript .

Now, an individual learning curve considers a single training set for every and calculates its corresponding risk as a function of . However, a single may deviate significantly from the expected behavior. Therefore, we are often interested in an averaging over many different sets , and ideally the expectation

(2)

The plot of against the training set size gives us the (expected) learning curve. From this point onward, when we talk about the learning curve, this is what is meant.

The preceding learning curve is defined for a single problem . Sometimes we wish to study how a model performs over a range of problems or, more generally, a full distribution over problems. The learning curve that considers such averaged performance is referred to as the problem-average (PA) learning curve:

(3)

The general term problem-average was coined in [18]. PA learning curves make sense for Bayesian approaches in particular, where an assumed prior over possible problems often arises naturally. For this reason, much of Section 5 relies on this notion. The risk integrated over the prior, in the Bayesian literature, is also called the Bayes risk, integrated risk, or preposterior risk [19, page 195]. The term preposterior signifies that we can determine this quantity without observing any data.

In semi-supervised learning

[20]

and active learning

[21], it can be of additional interest to study the learning behavior as a function of the number of unlabeled and actively selected samples, respectively.

2.2 Estimating Learning Curves

In practice, we merely have a finite sample from and we cannot measure or consider all possible training sets sampled from . We can only get an estimate of the learning curve. Popular approaches are to use a hold-out dataset or -fold cross validation for this [22, 23, 24, 25] as also apparent from the Weka documentation [26] and Scikit-learn implementation [27]. Using cross validation, folds are generated from the dataset. For each split, a training set and a test set are formed. The size of the training set is varied over a range of values by removing samples from the originally formed training set. For each size the learning algorithm is trained and the performance is measured on the test fold. The process is repeated for all folds, leading to

individual learning curves. The final estimate is their average. The variance of the estimated curve can be reduced by carrying out the

-fold cross validation multiple times [28]. This is done, for instance, in [29, 22, 23, 24].

Using cross validation to estimate the learning curve has some drawbacks. For one, when making the training set smaller, not using the discarded samples for testing seems wasteful. Also note that the training fold size limits the range of the estimated learning curve, especially if is small. Directly taking a random training set of the preferred size and leaving the remainder as a test set can be a good alternative. This can be repeated to come to an averaged learning curve. This recipe—employed, for instance in [30, 29]—allows for easy use of any range of , leaving no sample unused. Note that the test risks are not independent for this approach. Alternatively, the bootstrap can also be considered to come to variable sized training sets [31, 32].

An altogether different way to learning curve estimation is to assume an underlying parametric model for the learning curve and fit this to the learning curves estimates obtained via approaches described previously. This approach is not widespread, but is largely confined to the research work that studies and exploits the general shape of learning curves (see Subsection 4.1).

Finally, note that all of the foregoing pertains to PA learning curves as well. In that setting, we may occasionally be able to exploit the special structure of the assumed problem prior. This is the case, for instance, with Gaussian process regression, where problem averages can sometimes be computed with no additional cost (Section 5).

2.3 Plotting Notes

When plotting the learning curve, it can be useful to consider logarithmic axes. Plotting linearly may mask small but non-trivial gains [33]. Also from a computational standpoint it often makes sense to have traverse a logarithmic scale [34] (see also Subsection 3.2). A log-log or semi-log plot can be useful if we expect the learning curve to display power-law or exponential behavior (Section 4

), as the curve becomes straight in that case. In such a plot, it can also be easier to discern small deviation from such behavior. Finally, it is common to use error bars to indicate the standard deviation over the folds to give an estimate of the variability of the individual curves.

2.4 Summarizing Learning Curves

Figure 1: Crossing learning curves. Red starts at a lower error, while blue reaches a lower error rate given enough data. The AULC is approximately equal for both curves.

It may be useful at times, to summarize learning curves into a single number. A popular metric to that end is the area under the learning curve (AULC) [35, 36, 37] (see [38] for early use). To compute this metric, one needs to settle at a number of sample sizes. One then averages the performance at all those sample sizes to get to the area of the learning curve. The AULC thus makes the curious assumption that all sample sizes are equally likely.

Important information can get lost when summarizing. The measure is, for instance, not able to distinguish between two methods whose learning curves cross (Figure 1), i.e., where the one method is better in the small sample regime, while the other is in the large sample setting. Others have proposed to report the asymptotic value of the learning curve and the number of samples to reach it [39] or the exponent of the power-law fit [40].

Depending on the application at hand, particularly in view of the large diversity in learning curve shapes, these summaries are likely inadequate. Recently, [41] suggested to summarize using all fit parameters, which should suffice for most applications. However, we want to emphasize one should try multiple parametric models and report the parameters of the best fit and the fit quality (e.g. MSE).

2.5 Fitting Notes

Popular parametric forms for fitting learning curves are given in Table 1 and their performance is discussed in Section 4.1. Here, we make a few general remarks.

Some works [42, 43, 44, 45] seem to perform simple least squares fitting on log-values in order to fit power laws or exponentials. If, however, one would like to find the optimal parameters in the original space in terms of the mean squared error, non-linear curve fitting is necessary (for example using Levenberg–Marquardt [46]

). Then again, assuming Gaussian errors in the original space may be questionable, since the loss is typically non-negative. Therefore, confidence intervals, hypothesis tests, and

-values should be interpreted with care.

For many problems, one should consider a model that allows for nonzero asymptotic error (like POW3 and EXP3 in Table 1

). Also, often the goal is to interpolate or extrapolating to previously

unseen training set sizes. This is a generalization task and we have to deal with the problems that this may entail, such as overfitting to learning curve data [46, 47]. Thus, learning curve data should also be split in train and test sets for a fair evaluation.

2.6 Feature Curves and Complexity

Figure 2: Top left: image of the error for varying sample size and dimensionality , for the pseudo-Fisher learning algorithm (without intercept) on a toy dataset with two Gaussian classes having identity covariance matrices. Their means are a distance 6 apart in 100 dimensions, with every dimension adding a same amount to the overall distance. Bottom left: by fixing and varying , i.e., taking a horizontal section, we obtain a learning curve (red). We also show the curve where is chosen optimally for each (blue). Top right (rotated by 90 degrees): by fixing and varying , i.e., a vertical section, we obtain a feature curve (purple). We also show the curve where the optimal sample size is chosen for each (yellow). Bottom right: here we show the paths taken through the image to obtain the learning and feature curves. The learning curve and feature curves are the straight lines, while the curves that optimize or take other paths. Observe that the largest and are not always optimal.

The word feature refers to the measurements that constitutes an input vector . A feature curve is obtained by plotting the performance of a machine learning algorithm against the varying number of measurements it is trained on [48, 49]. To be a bit more specific, let be a procedure that selects of the original features, hence reducing the dimensionality of the data to . A feature curve is then obtained by plotting versus . As opposed to the learning curve where is kept constant, is now the quantity that is fixed. As such, it gives a complementary view.

The selection of features as carried out by means of

, can be performed in various ways. Sometimes features have some sort of inherent ordering. In other cases PCA or feature selection can provide such ordering. When no ordering can be assumed,

selects random features from the data—possibly even with replacement. In this scenario, it is sensible to construct a large number of different feature curves, based on different random subsets, and report their average as the final curve.

Typically, an increase in the number of input dimensions means that the complexity of the learner also increases. As such it can, more generally, be of interest to plot performance against any actual, approximate, or substitute measure of the complexity. Instead of changing the dimensionality, changing parameters of the learner, such as the smoothness of a kernel or the amount of filters in a CNN, can also be used to vary the complexity to obtain similar curves [50, 51, 52, 53, 48]. Such curves are sometimes called complexity curves [52], parameter curves [54] or generalization curves [55].

One of the better known phenomena of feature curves is the so-called peaking phenomenon (also the peak effect, peaking or Hughes phenomenon [56, 57, 58, 2]

). The peaking phenomenon of feature curves is related to the curse of dimensionality and illustrates that adding features may actually degrade the performance of a classifier, leading to the classical U-shaped feature curve. Behavior more complex than the simple U-shape has been observed as well

[59, 60, 61] and has recently been referred to as double descent [62] (see Figure 2, top right). This is closely related to peaking of (ill-behaving) learning curves (Subsection 6.2).

2.7 Combined Feature and Learning Curves

Generally, the performance of a learner is not influenced independently by the the number of training samples and the number of features . In fact, several theoretical works suggest that the fraction is essential. See, for instance, [63, 64, 65]. Because of the feature-sample interaction, it can be insightful to plot multiple learning curves for a variable number of input dimensions or multiple feature curves for different training set sizes. Another option is to make a 3D plot—e.g. a surface plot—or a 2D image of the performance against both and directly. Instead of the number of features we can use any other measure of complexity as well.

As an illustration, Figure 2 shows a plot of the performance of pseudo-Fisher’s linear discriminant (PFLD; see Section 6.2) when varying both and . Taking a section of this surface, we obtain either a learning curve (horizontal section, fixed ) or a feature curve (vertical section, fixed ). The full, 2D contour plot gives a more complete view of the interaction between and . Observe, for instance, in the bottom right subfigure that the optimal depends on . Likewise, for the particular classifier that we study, there is an optimal for each , i.e., the largest possible value of is not necessarily the best.

Duin [60] is possibly the first to include such 3D plot, though already since the work of Hughes [48], figures that combine multiple learning or feature curves have been used, see for example [66, 67], and for combinations with complexity curves see [68, 51, 69, 50]. More recently, [53, 70] gives 2D images of the performance of deep neural networks as a function of both model and training set size.

3 General Practical Usage

The study of learning curves has both practical and research/theoretical value. While we do not necessarily aim to make a very strict separation between the two, more emphasis is put on the latter further on. This section focuses on part of the former and covers the current, most important uses of learning curves when it comes to applications, i.e., model selection and extrapolation to reduce data collection and computational costs.

3.1 Better Model Selection and Crossing Curves

Machine learning as a field has shifted more and more to benchmarking learning algorithms, e.g., in the last 20 years, more than 2000 benchmark datasets have been created (see [71] for an overview). These benchmarks are often set up as competitions [72] and investigate which algorithms are better or which novel procedure outperforms existing ones [33]. Typically, a single number, summarizing performance, is used as evaluation measure.

A recent meta-analysis indicates that the most popular measures are accuracy, the F-measure, and precision [73]. An essential issue these metrics ignore is that sample size can have a large influence on the relative ranking of different learning algorithms. In a plot of learning curves this would be visible as a crossing of the different curves (see Figure 1). In that light, it is beneficial if benchmarks consider multiple sample sizes, to get a better picture of the strengths and weaknesses of the approaches. The learning curve provides a concise picture of this sample size dependent behavior.

Crossing curves have also been referred to as the scissor effect and have been investigated since the 1970s [74, 75, 67] (see also [76]). Contrary to such evidence, there are papers that suggest that learning curves do not cross [77, 78]

. The latter claim is specific to deep learning, where, perhaps, exceptions may occur that are currently not understood.

Perhaps the most convincing evidence for crossing curves is given in [33]

. The paper compares logistic regression and decision trees on 36 datasets. In 15 of the 36 cases the learning curves cross. This may not always be apparent, however, as large sample sizes may be needed to find the crossing point. In the paper, the complex model (decision tree) is better for large sample sizes, while the simple model (logistic regression) is better for small ones. Similarly, Strang et al. 

[79] performed a large-scale meta-learning study on 294 datasets, comparing linear versus nonlinear models, and found evidence that non-linear methods are better when datasets are large. Ng and Jordan [25]

found, when comparing naive Bayes to logistic regression, that in 7 out of 15 datasets considered the learning curves crossed.

[42, 80, 81, 82, 83] provide further evidence.

Also using learning curves, [33] finds that, besides sample size, separability of the problem can be an indicator of which algorithm will dominate the other in terms of the learning curve. Beyond that, the learning curve, when plotted together with the training error of the algorithm can be used to detect whether a learner is overfitting [2, 54, 84, 3]. Besides sample size, dimensionality seems also an important factor to determine whether linear or non-linear methods will dominate [79]. To that end, learning curves combined with feature curves may offer further insights.

3.2 Extrapolation to Reduce Data Collection Costs

When collecting data is time-consuming, difficult, or otherwise expensive the possibility to accurately extrapolate a learner’s learning curve can be useful. Extrapolations (typically base on some parametric learning curve model, see Subsection 4.1) give an impression beforehand of how many examples to collect to come to a specific performance and allows one to judge when data collection can be stopped [43]. Examples of such practice can, for instance, be found in machine translation [47] and medical applications [22, 29, 23]. Last [45] quantifies potential savings assuming a fixed cost per collected sample and per generalization error. Extrapolating the learning curve using some labeled data, the point at which it is not worth anymore to label more data can be determined and data collection stopped.

Determining a minimal sample size is called sample size determination. For usual statistical procedures this is done through what is called a power calculation [85]. For classifiers, sample size determination using a power calculation is unfeasible according to [22, 23]. John and Langley [86] illustrate that a power calculations that ignores the machine learning model indeed fails to accurately predict the minimal sample size.

Sample size determination can be combined with meta-learning, which uses experience on previous datasets to inform decisions on new datasets. To that end, [87] builds a small learning curve on a new and unseen dataset and compares it to a database of previously collected learning curves to determine the minimum sample size.

3.3 Speeding Up Training and Tuning

Learning curves can be used to reduce computation time and memory with regards to training models, model selection and hyperparameter tuning.

To speed up training, so-called progressive sampling [34] uses a learning curve to determine if less training data can reach adequate performance. If the slope of the curve becomes too flat, learning is stopped, making training potentially much faster. It is recommended to use a geometric series for to reduce computational complexity.

Several variations on progressive sampling exist. John and Langley [86] proposes the notion of probably close enough where a power-law fit is used to determine if the learner is so-called epsilon-close to the asymptotic performance. [88] gives a rigorous decision theoretic treatment of the topic. By assigning costs to computation times and performances, they estimate what should be done to minimize the expected costs. Progressive sampling also has been adapted to the setting of active learning [89]. [87] combines meta-learning with progressive sampling to obtain a further speedup.

To speed up model selection, [90] compares initial learning curves to a database of learning curves to predict which of two classifiers will perform best on a new dataset. This can be used to avoid costly evaluations using cross validation. Leite and Brazdil [91] propose an iterative process that predicts the required sample sizes, builds learning curves, and updates the performance estimates in order to compare two classifiers. Rijn et al. [92] extend the technique to rank many machine learning models according to their predicted performance, tuning their approach to come to an acceptable answer in as little time as possible.

With regards to hyperparameter tuning, already in 1994 Cortes et al. [42] devised an extrapolation scheme for learning curves, based on the fitting of power laws, to determine if it is worth to fully train a neural network. In the deep learning era, this has received renewed attention. [93] extrapolates the learning curve to optimize hyperparameters. [41] takes this a step further and actually optimize several design choices, such as data augmentation. One obstacle for such applications is that it remains unclear when the learning curve has which shape.

4 Well-Behaved Learning Curves

We deem a learning curve well-behaved if it shows improved performance with increased training sample sizes, i.e., for all . In slightly different settings, learners that satisfy this property are called smart [94, page 106] and monotone [95].

There is both experimental and theoretical evidence for well-behaved curves. We discuss empirical evidence, which shows that, for most models, the power law offers the best fit. Here, evidence for deep nets is most convincing. For problems with binary features and decision trees, exponential curves cannot be ruled out. Theory supports these findings and indicates that properties of the problem and the model class determine whether the curve will be of an exponential or a power-law shape. We also cover PAC learning and why it may not tell us much about the shape of the curve. Finally, theory regarding the PA curve also favors exponential and power-law shapes. These curves turn out to be provably monotone if the problem is well-specified and a Bayesian approaches is used.

4.1 In Depth Empirical Studies of Parametric Fits

Reference Formula Used in
POW2 [43]*[46][44][45]
POW3 [46]*[47]*[42][96]
LOG2 [44]*[43][46][45][96]
EXP3 [96]*[47]
EXP2 [43][44][45]
LIN2 [43][44][45][96]
VAP3 [46]
MMF4 [46]
WBL4 [46]
EXP4 [47]
EXPP3 [47]
POW4 [47]
ILOG2 [47]
EXPD3 [97]
Table 1: Parametric learning curve models. Note that some curves model performance increase rather than loss decease. The first column gives the abbreviation used and the number of parameters. Bold and asterisk marks the paper this model came out as the best fit.

Various works have studied the fitting of empirical learning curves and found that they typically can be modelled with function classes depending on few parameters. Table 1 provides a comprehensive overview of the parametric models studied in machine learning, models for human learning in [98] may offer further candidates. Two of the primary objectives in studies of fitting learning curves are how good a model interpolates an empirical learning curve over an observed range of training set sizes and how well it can extrapolate beyond that range.

Studies investigating these parametric forms often find the power law with offset (POW3 in the table) to offer a good fit. The offset makes sure that a non-zero asymptotic error can be properly modeled, which seems a necessity in any challenging real-world setting. Surprisingly, even though Frey and Fisher [43] do not include this offset and use POW2, they find for decision trees that on 12 out of 14 datasets they consider, the power law fits best. Gu et al. [46] extend this work to datasets of larger sizes and, next to decision trees, also uses logistic regression as a learner. They use an offset in their power law and consider other functional forms, notably, VAP, MMF, and WBL. For extrapolation, the power law with bias performed the best overall. Also Last [45] trains decision trees and finds the power law to perform best. Kolachina et al. [47] give learning curves for machine translation in terms of BLUE score for 30 different settings. Their study considers several parametric forms (see Table 1) but also they find that the power law is to be preferred.

Boonyanunta and Zeephongsekul [97] perform no quantitative comparison and instead postulate that a differential equation models learning curves, leading them to an exponential form, indicated by EXPD in the table. [99]

empirically finds exponential behavior of the learning curve for a perceptron trained with backpropagation on a toy problem with binary inputs, but neither performs an in depth comparison. In addition, the experimental setup is not described precisely enough: for example, it is not clear how step sizes are tuned or if early stopping is used.

Three studies find more compelling evidence for deviations from the power law. The first, [100], can be seen as an in-depth extension of [99]. They train neural networks on four synthetic datasets and compare the learning curves using the goodness of fit. Two synthetic problems are linearly separable, the others require a hidden layer, and all can be modeled perfectly by the network. Whether a problem was linearly separable or not doesn’t matter for the shape of the curve. For the two problems involving binary features exponential learning curves were found, whereas problems with real-valued features a power law gave the best fit. However, they also note, that it is not always clear that one fit is significantly better than the other.

The second study, [44], evaluates a diverse set of learners on four datasets and shows the logarithm (LOG2) provides the best fit. The author has some reservations about the results and mentions that the fit focuses on the first part of the curve as a reason that the power law may seem to perform worse, besides that POW3 was also not included. Given that many performance measures are bounded, parametric models that increase or decrease beyond any limit should eventually give an arbitrarily bad fit for increasing . As such, LOG2 is anyway suspect.

The third study, [96], considers only the performance of the fit on training data, e.g. already observed learning curves points. They do only use learning curve models with a maximum of three parameters. They employs a total of 121 datasets and use C4.5 for learning. In 86 cases, the learning curve shape fits well with one of the functional forms, in 64 cases EXP3 gave the lowest overall MSE, in 13 it was POW3. A signed rank test shows that the exponential outperforms all others models mentioned for [96] in Table 1. Concluding, the first and last work provide strong evidence against the power law in certain settings.

4.2 Power Laws and Eye-balling Deep Net Results

Studies of learning curves of deep neural networks mostly claim to find power-law behavior. However, initial works offer no quantitative comparisons to other parametric forms and only consider plots of the results, calling into question the reliability of such claims. Later contributions find power-law behavior over many orders of magnitude of data, offering substantially stronger empirical evidence.

Sun et al. [101] state that their mAP performance on a large-scale internal Google image dataset increases logarithmically in dataset size. This claim is called into question by [93] and we would agree: there seems to be little reason to believe this increase follows a logarithm. Like for [44], who also found that the logarithm fit well, one should anyway remark that the performance in terms of mAP is always bounded from above and therefore the log model should eventually break. As opposed to [101], [102] does observe diminishing returns in a similar large-scale setting. [103] also studies large-scale image classification and find learning curves that level off more clearly in terms of accuracy over orders of magnitudes. They presume that this is due to the maximum accuracy being reached, but note that this cannot explain the observations on all datasets. In the absence of any quantitative analysis, these results are possibly not more than suggestive of power-law behavior.

The first to offer much more convincing empirical evidence for the power law are Hestness et al. [93]. They show power laws over multiple orders of magnitude of training set sizes for a broad range of domains: machine translation (error rate), language modeling (cross entropy), image recognition (top-1 and top-5 error rate, cross entropy) and speech recognition (error rate). Its exponent was found to be between and and mostly depends on the domain. Architecture and optimizer primarily determine the multiplicative constant. For small sample sizes, the power law supposedly does not hold anymore, as the neural network converges to a random guessing solutions. Overall, this law turns out to be so robust that they suggest one can search for new architectures at smaller sample sizes to speed up experiments. To uncover the power law, significant tuning of the hyperparameters and model size per sample size is necessary, otherwise deviations occur. [41] investigate robust curve fitting for the error rate using the power law with offset, and use extrapolations to optimize design decisions. They generally find exponents of size and , thus of larger magnitude than [93].

Kaplan et al. [104] and Rosenfeld et al. [70]

further corroborate power laws for image classification and natural language processing.

[104] finds that if the model size and computation time are increased together with the sample size, that the learning curve has this particular behavior. If either is too small this pattern disappears. They find that the test loss also behaves as a power law as function of model size and training time and that the training loss can also be modeled in this way. [70] reports that the test loss behaves as a power law in sample size when model size is fixed and vice versa. Both provide models of the generalization error that can be used to extrapolate performances to unseen sample and model sizes and that may reduce the amount of tuning required to get to optimal learning curves. They also propose a model for the transition of a power law to random guessing performance.

4.3 What Determines the Parameters of the Fit?

Next to the parametric form as such, researchers have investigated what determines the parameters of the fits. [22], [42], and [41] provide evidence that the asymptotic value of the power law and its exponent could be related. Singh [44] investigates the relation between dataset and classifier but does not find any effect. They do find that the neural networks and the SVM are more often well-described by a power law and that decision trees are best predicted by a logarithmic model. Only a limited number of datasets and models was tested however. Perlich et al. [33] speculate that the Bayes error may be indicative of whether the curves of decision trees and logistic regression will cross or not. In case the Bayes error is small, decision trees will often be superior for large sample sizes. All in all, there are few results of this type and most are quite preliminary.

4.4 Shape Depends on Hypothesis Class

Turning to theoretical evidence, a first provably exponential learning curve can be traced back to the famous work on 1NN [105] by Cover and Hart. They point out that, in a two-class problem, if the classes are far enough apart, 1NN only misclassifies samples from one class if all training samples are from the other. In case both classes have equal prior, one can determine that . This seems to suggest that if a problem is well-separated, classifiers may converge exponentially fast.

Also in the context of classification, results from learning theory, especially in the form of Probably Approximately Correct (PAC) bounds [106, 107], have been referred to to justify power-law shapes in both the separable and non-separable case [41]. The PAC model is, however, pessimistic since it considers the performance on a worst-case distribution , while the behavior on the actual fixed can be much more favorable (see, for instance, [108, 109, 110, 111, 112]). Even more problematic, the worst-case distribution considered by PAC is not fixed, and can depend on the training size , while for the learning curve is fixed [113].

A more fitting approach, pioneered by Schuurmans [114, 115] and inspired by the findings in [100], does characterize the shape of the learning curve for a fixed unknown . This has also been investigated in [116, 117] for specific learners. Recently, Bousquet et al. [113] gave a full characterization of all learning curve shapes for the realizable case and optimal learners. This last work shows that optimal learners can have only three shapes: an exponential shape, a power-law shape, or a learning curve that converges arbitrarily slow. The optimal shape is determined by novel properties of the hypothesis class (not the VC dimension). The result of arbitrarily slow learning is, partially, a refinement of the no free lunch theorem of [94, Section 7.2] and concerns, for example, hypothesis classes that encompass all measurable functions. These results are a strengthening of a much earlier observation by Cover [118].

4.5 Shape Depends on the Problem (PA)

There are a number of works that find evidence for the power-law shape for PA learning curves under various assumptions. At the end of this section we discuss a work that also finds exponential PA curves.

Amari [119, 120] studies PA learning curves for a basic algorithm, the Gibbs learning algorithm, in terms of the cross entropy. The latter is equal to the logistic loss in the two-class setting and underlies logistic regression. [119] refers to it as the entropic error or entropy loss. The Gibbs algorithm is a stochastic learner that assumes a prior over all models considered and, at test time, samples from the posterior defined through the training data, to come to a prediction [121, 65]. Separable data is assumed and the model samples are therefore taken from version space [65].

For , the expected cross entropy, , can be shown to decompose using a property of the conditional probability [119]. Let be the probability of selecting a classifier from the prior that classifies all samples in correctly. Assume, in addition, that . Then, while for general losses we end up with an expectation that is generally hard to analyze [122], the expected cross entropy simplifies into the difference of two expectations [119]:

(4)

Under some additional assumptions, which ensure that the prior is not singular, the behavior asymptotic in can be fully characterized. Amari [119] then demonstrates that

(5)

where is the number of parameters.

Amari and Murata [123] extend this work and consider labels generated by a noisy process, allowing for class overlap. Besides Gibbs, they study algorithms based on maximum likelihood estimation and the Bayes posterior distribution. They find for Bayes and maximum likelihood that the entropic generalization error behaves as , while the training error behaves as , where is the best possible cross entropy loss. For Gibbs, the error behaves as , and the training error as . In case of model mismatch, the maximum likelihood solution can also be analyzed. In that setting, the number of parameters changes to a quantity indicating the number of effective parameters and becomes the loss of the model closest to the groundtruth density in terms of KL-divergence.

In a similar vein, Amari et al. [122] analyze the 0-1 loss, i.e., the error rate, under a so-called annealed approximation [64, 65], which approximates the aforementioned hard-to-analyze risk. Four settings are considered, two of which are similar to those in [123, 120, 119]. The variation in them stems from differences in assumptions about how the labeling is realized, ranging from a unique, completely deterministic labeling function to multiple, stochastic labelings. Possibly the most interesting result is for the realizable case where multiple parameter settings give the correct outcome or, more precisely, where this set has nonzero measure. In that case, the asymptotic behavior is described as a power law with an exponent of . Note that this is essentially faster than what the typical PAC bounds can provide, which are exponents of and (a discussion of those results is postponed to section 4.4). This possibility of a more rich analysis is sometimes mentioned as one of the reasons for studying learning curves [124, 111, 65].

For some settings, exact results can be obtained. If one considers a 2D input space where the marginal

is a Gaussian distribution with mean zero and identity covariance, and one assumes a uniform prior over the true linear labeling function, the PA curve for the zero one loss can exactly be computed to be of the form

, while, the annealed approximation gives [122]. Two decades earlier, in a rather different setting, Peterson [125] showed for the nearest neighbor classifier (1NN) that in a two-class problem where is uniform on and equals for one of the two classes, the learning curve equals .

Schwartz et al. [126] use tools similar to Amari’s to study the realizable case where all variables (features and labels) are binary and Bayes rule is used. Under their approximations, the PA curve can be completely determined from a histogram of generalization errors of models sampled from the prior. For large sample sizes, the theory predicts that the learning curve actually has an exponential shape. The exponent depends on the gap in generalization error between the best and second best model. In the limit of a gap of zero size, the shape reverts to a power law. The same technique is used to study learning curves that can plateau before dropping of a second time [127]. [128] proposes extensions dealing with label noise. [124] casts some doubt on the accuracy of these approximations and the predictions of the theory of Schwartz et al. indeed deviate quite a bit from their simulations on toy data. What is particularly interesting about these works is that they find learning curve behavior both of power-law and of exponential type, something that [100] experimentally found.

4.6 Monotone Shape if Well-Specified (PA)

The PA learning curve is monotone if the prior and likelihood model are correct and Bayesian inference is employed. This is a consequence of the total evidence theorem

[129, 130, 131]. It states, informally, that one obtains the maximum expected utility by taking into account all observations. However, a monotone PA curve does not rule out that the learning curve for individual problems can go up, even if the problem is well-specified, as the work covered in Section 6.7 points out. Thus, if we only evaluate in terms of the learning curve of a single problem, using all data is not always the rational strategy.

It may be of interest to note here that, next to Bayes’ rule, there are other ways of consistently updating one’s belief—in particular, so-called probability kinematics—that allow for an alternative decision theoretic setting in which a total evidence theorem also applies [132].

Of course, in reality, our model probably has some misspecification, which is a situation that has been considered for Gaussian process models and Bayesian linear regression. Some of the unexpected behavior this can lead to is covered in Subsection

6.5 for PA curves and Subsection 6.6 for the regular learning curve. Next, however, we cover further results in the well-behaved setting. We do this, in fact, specifically for Gaussian processes.

5 Gaussian Process Learning Curves

In Gaussian process (GP) regression [133], it is especially the PA learning curve (Subsection 2.1) for the squared loss, under the assumption of a Gaussian likelihood, that has been studied extensively. A reason for this is that many calculations simplify in this setting. We cover various approximations, bounds, and ways to compute the PA curve. We also discuss assumptions and their corresponding learning curve shapes and cover the factors that affect the shape. It appears difficult to say something universally about the shape, besides that it cannot decrease faster than asymptotically. The fact that these PA learning curves are monotone in the correctly specified setting can, however, be exploited to derive generalization bounds. We briefly cover those as well. This section is limited to correctly specified, well-behaved curves, Subsection 6.5 is devoted to ill-behaved learning curves for misspecified GPs.

The main quantity that is visualized in the PA learning curve of GP regression is the Bayes risk or, equivalently, the problem averaged squared error. In the well-specified case, this is equal to the posterior variance [133, Equation (2.26)][134],

(6)

Here is the covariance function or kernel of the GP, is the kernel matrix of the input training set , where . Similarly, is the vector where the th component is . Finally, is the noise level assumed in the Gaussian likelihood.

The foregoing basically states that the averaging over all possible problems (as defined by the GP prior) and all possible output training samples is already taken care of by Equation (6). Exploiting this equality, what is then left to do to get to the PA learning curve is an averaging over all test points according to their marginal and an averaging over all possible input training samples .

Finally, for this section, it turns out to be convenient to introduce a notion of PA learning curves in which only the averaging over different input training sets has not been carried out yet. We denote this by and we will not mention the learning algorithm (GP regression).

5.1 Fixed Training Set and Asymptotic Value

The asymptotic value of the learning curve and the value of the PA curve for a fixed training set can be expressed in terms of the eigendecomposition of the covariance function. This decomposition is often used when approximating or bounding GPs’ learning curves [135, 134, 136].

With be the marginal density and

the covariance function. The eigendecomposition constitutes all eigenfunctions

and eigenvalues

that solve for

(7)

Usually, the eigenfunctions are chosen so that

(8)

where is Kronecker’s delta [133]. The eigenvalues are non-negative and assumed sorted from large to small (). Depending on and the covariance function , the spectrum of eigenvalues may be either degenerate, meaning there may be finite nonzero eigenvalues, or nondegenerate in which case there are infinite nonzero eigenvalues. For some, analytical solutions to the eigenvalues and functions are known, e.g. for the squared exponential covariance function and Gaussian . For other cases the eigenfunctions and eigenvalues can be approximated [133].

Now, take the diagonal matrix with on the diagonal and let , where each comes from the training matrix . The dependence on the training set is indicated by . For a fixed training set, the squared loss over all problems can then be written as

(9)

which is exact [135]. The only remaining average to compute to come to a PA learning curve is with respect to . This last average is typically impossible to calculate analytically (see [133, p. 168] for an exception). This leads one to consider the approximations and bounds covered in Section 5.3.

The asymptotic value of the PA learning curve is [137]

(10)

where convergence is in probability and almost sure for degenerate kernels. Moreover, it is assumed that for a constant , which means that the noise level grows with the sample size. The assumption seems largely a technical one and Le Gratiet and Garnier [137] claim this assumption can still be reasonable in specific settings.

5.2 Two Regimes, Effect of Length Scale on Shape

For many covariance functions (or kernels), there is a characteristic length scale that determines the distance in feature space one after which the regression function can change significantly. Williams and Vivarelli [134] make the qualitative observation that this lead the PA curve to often have two regimes: initial and asymptotic. If the length scale of the GP is not too large, the initial decrease of the learning curve is approximately linear in (initial regime). They explain that, initially, the training points are far apart, and thus they can almost be considered as being observed in isolation. Therefore, each training point reduces the posterior variance by the same amount, and thus the decrease is initially linear. However, when is larger, training points get closer together and their reductions in posterior variance interact. Then reduction is not linear anymore because an additional point will add less information than the previous points. Thus there is an effect of diminishing returns and the decrease gets slower: the asymptotic regime [135, 134].

The smaller the length scale, the longer the initial linear trend, because points are comparatively further apart [134, 135]. Williams and Vivarelli further find that changing the length scale effectively rescales the amount of training data in the learning curve for a uniform marginal

. Furthermore, they find that a higher noise level and smoother covariance functions results in earlier reaching of the assymptotic regime for the uniform distribution. It remains unclear how these results generalize to other marginal distributions.

Sollich and Halees [135] note that in the asymptotic regime, the noise level has a large influence on the shape of the curve since here the error is reduced by averaging out noise, while in the non-asymptotic regime the noise level hardly plays a role. They always assume that the noise level is much smaller than the prior variance (the expected fluctuations of the function before observing any samples). Under that assumption, they compare the error of the GP with the noise level to determine the regime: if the error is smaller than the noise, it indicates that one is reaching the asymptotic regime.

5.3 Approximations and Bounds

A simple approximation to evaluate the expectation with respect to the training set from Equation 9 is to replace the matrix by its expected value (this is the expected value due to Equation 8). This results in

(11)

The approximation, which should be compared to Equation (10), is shown to be an upper bound on the training loss and a lower bound on the PA curve[138]. From asymptotic arguments, it can be concluded that the PA curve cannot decrease faster than for large [133, p. 160]. Since asymptotically the training and test error coincide, Opper and Vivarelli [138] expect this approximation to give the correct asymptotic value, which, indeed, is the case [137].

The previous approximation works well for the asymptotic regime but for non-asymptotic cases it is not accurate. Sollich [139] aims to approximate the learning curve in such a way that it can characterizes both regimes well. To that end, he, also in collaboration with Halees [135], introduces three approximations based on a study of how the matrix inverse in Equation (9) changes when is increased. He derives recurrent relations that can be solved and lead to upper and lower approximations that typically enclose the learning curve. The approximations have a form similar to those in Equation (11). While [139] initially hypothesized these approximations could be actual bounds on the learning curve, Sollich and Halees [135] gave several counter examples and disproved this claim.

For the noiseless case, in which , Michelli and Wahba [140] give a lower bound for , which in turn provides a lower bound on the learning curve:

(12)

Plaskota [141] extends this result to take into account noise in the GP and Sollich and Halees [135] provide further extensions that hold under less stringent assumptions. Using a finite-dimensional basis, [142] develops a method to approximate GPs that scales better in and finds an upper bound. The latter immediately implies a bound on the learning curve as well.

Several alternative approximations exist. For example, Särkkä [143] uses numerical integration to approximate the learning curve, for which the eigenvalues do not need to be known at all. Opper [144] gives an upper bound for the entropic loss using techniques from statistical physics and derives the asymptotic shape of this bound for Wiener processes for which and the squared exponential covariance function. [135] notes that in case the entropic loss is small it approximates the squared error of the GP, thus for large this also implies a bound on the learning curve.

5.4 Limits of the Eigenvalue Spectrum

Sollich and Halees [135] also explore what the limits are of bounds and approximations based on the eigenvalue spectrum. They create problems that have equal spectra but different learning curves, indicating that learning curves cannot be predicted reliably based on eigenvalues alone. Some works rely on more information than just the spectrum, such as [134, 142] whose bounds also depend on integrals involving the weighted and squared versions of the covariance function. Sollich and Halees also shows several examples where the lower bounds from [138] and [141] can be arbitrarily loose, but at the same time cannot be significantly tightened. Also their own approximation, which empirically is found to be the best, cannot be further refined. These impossibility results leads them to question the use of the eigenvalue spectrum to approximate the learning curve and what further information may be valuable.

Malzahn and Opper [145] re-derive the approximation from [135] using a different approach. Their variational framework may provide more accurate approximations to the learning curve, presumably also for GP classification [135, 133]. It has also been employed, among others, to estimate the variance of PA learning curves [146, 147].

5.5 Smoothness and Asymptotic Decay

The smoothness of a GP can, in 1D, be characterized by the Sacks-Ylvisaker conditions [148]. These conditions capture an aspect of the smoothness of a process in terms of its derivatives. The order used in these regularity conditions indicates, roughly, that the th derivative of the stochastic process exists, while the th does not. Under these conditions, Ritter [149] showed that the asymptotic decay rate of the PA learning curve is of order . Thus the smoother the process the faster the decay rate.

As an illustration: the squared exponential covariance induces a process that is very smooth as all orders of derivatives exist. The smoothness of the so-called modified Bessel covariance function [133] is determined by its order , which relates to the order in the Sacks-Ylvisaker conditions as . If , the covariance function leads to a process that is not even once differentiable. In the limit of , it converges to the squared exponential covariance function [134, 133]. For the roughest case the learning curve behaves asymptotically as . For the smoother SE covariance function the rate is [144]. For other rates see [133, Chapter 7] and [134].

In particular cases, the approximations in [135] can have stronger implications for the rate of decay. If the eigenvalues decrease as a power law , in the asymptotic regime (), the upper and lower approximations coincide and predict that the learning curve shape is given by . For example, for covariance function of the classical Ornstein–Uhlenbeck process, i.e., with the length scale, we have . The eigenvalues of the squared exponential covariance function decay faster than any power law. Taking , this implies a shape of the form . These approximation are indeed in agreement with known exact results [135]. In the initial regime (), if one takes the lower approximation gives . In this case, the suggested curve for the Ornstein–Uhlenbeck process takes on the form , which agrees with exact calculations as well. In the initial regime for the squared exponential, no direct shape can be computed without additional assumptions. Assuming , a uniform input distribution, and large , the approximation suggests a shape of the form for some constant [135], which is faster even than exponential.

5.6 Bounds through Monotonicity

Subsection 4.6 indicated that the PA learning curve is always monotone if the problem is well specified. Therefore, under the assumption of a well-specified GP, its learning curve is monotone when its average is considered over all possible training sets (as defined by the correctly specified prior). As it turns out, however, this curve is already monotone before averaging over all possible training sets, as long as the smaller set is contained in the larger, i.e., for all . This is because the GP’s posterior variance decreases with every addition of a training object. This can be proven from standard results on the conditioning a multivariate Gaussians [134]

. Such sample-based monotonicity of the posterior variance can generally be obtained if the likelihood function is modeled by an exponential family and a corresponding conjugate prior is used

[150].

Williams and Vivarelli [134] use this result to construct bounds on the learning curve. The key idea is to take the training points and treat them as training sets of just a single point. Each one gives an estimate of generalization error for the test points (Equation 6). Because the error always decreases when increasing the training set, the minimum over all train points creates a bound on the error for the whole training set for the test point. The minimizer is the closest training point.

To compute the a bound on the generalization error, one needs to consider the expectation w.r.t. . In order to keep the analysis tractable, Williams and Vivarelli [134] limit themselves to 1D input spaces where is uniform and perform the integration numerically. They further refine the technique to training sets of two points as well, and, find that as expected the bounds become tighter, since larger training sets imply better generalization. Sollich and Halees [135] extend their technique to non-uniform 1D . By considering training points in a ball of radius around a test point, Lederer et al. [136] derive a similar bound that converges to the correct asymptotic value. In contrast, the original bound from [134] becomes looser as grows. Experimentally, [136] shows that under a wide variety of conditions their bound is relatively tight.

6 Ill-Behaved Learning Curves

It is important to understand that learning curves do not always behave well and that this is not necessarily an artifact of the finite sample or the way an experiment is set up. Deterioration with more training data can obviously occur when considering the curve for a particular training set, because for every , we can be unlucky with our draw of . That ill-behavior can also occur in expectation, i.e., for , may be less obvious.

In the authors’ experience, most researchers expect improved performance of their learner with more data. Less anecdotal evidence can be found in literature. [107, page 153] states that when surpasses the VC-dimension, the curve must start decreasing. [18, Subsection 9.6.7] claims that for many real-world problems they decay monotonically. [151] calls it expected that performance improves with more data and [46] makes a similar claim, [152] and [153] consider it conventional wisdom, and [97] considers it widely accepted. Others assume well-behaved curves [34], which usually means that curves are smooth and monotone [154]. [122] states that the generalization error decreases as training set size increases. Further note that all works in Subsection 4.1 only consider monotone parametric models.

Figure 3: Qualitative overview of various learning curve shapes placed in different categories with references to their corresponding subsections. All have the sample size on the horizontal axis. Dotted lines indicate the transition from under to overparametrized models. Abbreviations; error: classification error; sq. loss: squared loss; NLL: negative log likelihood; abs. loss: absolute loss; PA indicates the problem-average learning curve is shown.

Before this section addresses actual bad behavior, we cover phase transitions, which are at the brink of becoming ill-behaved. Possible solutions to nonmonotonic behavior are discussed at the end. Figure 

3 provides an overview of types of ill-behaved learning curve shapes with subsection references. The code reproducing these curves (all based on actual experiments) can be retrieved from https://github.com/tomviering/ill-behaved-learning-curves.

6.1 Phase Transitions

As for physical systems, in a phase transition, particular learning curve properties change relatively abruptly, (almost) discontinuously. Figure 3 gives an example of how this can manifest itself. In learning, techniques from statistical physics can be employed to model and analyze these transitions, where it typically is studied in the limit of large samples and high input dimensionality [65]. Most theoretical insights are limited to relatively simple learners, like the perceptron, and apply to PA curves.

Let us point out that abrupt changes also seem to occur in human learning curves [10, 155], in particular when the task is complex and has a hierarchical structure [156]. A first mention of the occurrence of phase transitions, explicitly in the context of learning curves, can be found in [157]. It indicates the transition from memorization to generalization, which occurs, roughly, around the time that the full capacity of the learner has been used. Györgyi [158] provides a first, more rigorous demonstration within the framework of statistical physics—notably, the so-called thermodynamic limit [65]. In this setting, actual transitions happen for single-layer perceptrons where weights take on binary values only.

The perceptron and its thermodynamic limit are considered in many later studies as well. The general finding is that, when using discrete parameter values—most often binary weights, phase transitions can occur [159, 124, 160]. The behavior is often characterized by long plateaus where the perceptron cannot learn at all (usually in the overparametrized, memorization phase, where ) and has random guessing performance, until a point where the perceptron starts to learn (at , the underparametrized regime) at which a discontinuous jump occurs to non-trivial performance.

Phase transitions are also found in two-layer networks with binary weights and activations [161, 160, 162]. This happens for the so-called parity problem where the aim is to detect the parity of a binary string [163] for which Opper [164] found phase transitions in approximations of the learning curve. Learning curve bounds may display phase transitions as well [165, 111], though [124, 165] question whether these will also occur in the actual learning curve. Both Sompolinsky [166] and Opper [164] note that the sharp phase transitions predicted by theory, will be more gradual in real-world settings. Indeed, when studying this literature, one should be careful in interpreting theoretical results, as the transitions may occur only under particular assumptions or in limiting cases.

For unsupervised learning, phase transitions have been shown to occur as well

[167, 168] (see the latter for additional references). Ipsen and Hansen [169] extend these analyses to PCA with missing data. They also show phase transitions in experiments on real-world data sets. [170] provides one of the few real application papers where a distinct, intermediate plateau is visible in the learning curve.

For Figure 3, we constructed a simple phase transition based on a two-class classification problem, , with the first 99 features standard normal and the 100th feature set to . PFLD’s performance shows a transition at for the error rate.

6.2 Peaking and Double Descent

The term peaking indicates that the learning curve takes on a maximum, typically in the form of a cusp, see Figure 3. Unlike many other ill behaviors, peaking can occur in the realizable setting. Its cause seems related to instability of the model. This peaking should not be confused with peaking for feature curves as covered in Subsection 2.6, which is related to the curse of dimensionality. Nevertheless, the same instability that causes peaking in learning curves can also lead to a peak in feature curves, see Figure 2. The latter phenomenon has gained quite some renewed attention in recent years under the name double descent [62].

By now, the term (sample-wise) double descent has become a term for the peak in the learning curve for deep neural networks [53, 171]. Related terminologies are model-wise double descent, that describe a peak in the plot of performance versus model size, and epoch-wise double descent, that shows a peak in the training curve [53].

Peaking was first observed for the PFLD [59]. The PFLD is the classifier minimizing the squared loss, using minimum-norm or ridgeless linear regression based on the pseudo-inverse. PFLD often peaks at , both for the squared loss and in terms of the classification error. A first theoretical model explaining this behavior in the thermodynamical limit is given in [63]. In such works, originating from statistical physics, the usual quantity of interest is that controls the relative sizes for and going to infinity [121, 64, 65].

Raudys and Duin [76] investigate this behavior in the finite sample setting where each class is a Gaussian. They approximately decompose the generalization error in three terms. The first term measures the quality of the estimated means and the second the effect of reducing the dimensionality due to the pseudo-inverse. These terms reduce the error when increases. The third term measures the quality of the estimated eigenvalues of the covariance matrix. This term increases the error when increases, because more eigenvalues need to be estimated at the same time if grows, reducing the quality of their overall estimation. These eigenvalues are often small and as the model depends on their inverse, small estimation errors can have a large effect, leading to a large instability [172] and peak in the learning curve around . Using an analysis similar to [76], [173] studies the peaking phenomenon in semi-supervised learning (see [20]) and shows that unlabeled data can both mitigate or worsen it.

Peaking of the PFLD can be avoided through regularization, e.g. by adding to the covariance matrix [76, 172]. The performance of the model is, however, very sensitive to the correct tuning of the ridge parameter [174, 172]. Assuming the data is isotropic, [175]

shows that peaking disappears for the optimal setting of the regularization parameter. Other, more heuristic solutions change the training procedure altogether, e.g.,

[176] uses an iterative procedure that decides which objects PFLD should be trained on, as such reducing and removing the peak. [177] adds copies of objects with noise, increasing , or increases the dimensionality by adding noise features, increasing . Experiments show this can remove the peak and improve performance.

Duin [60] illustrates experimentally that the SVM may not suffer from peaking in the first place. Opper [164] suggests a similar conclusion based on a simplistic thought experiment. For specific learning problems, both [63] and [159] already give a theoretical underpinning for the absence of double descent for the perceptron of optimal (or maximal) stability, which is a classifier closely related to the SVM. Opper [178] studies the behavior of the SVM in the thermodynamic limit which does not show peaking either. Spigler et al. [179] show, however, that double descent for feature curves can occur using the (squared) hinge loss, where the peak is typically located at an .

Further insight of when peaking can occur may be gleaned from recent works like [180] and [181]

, which perform a rigorous analysis of the case of Fourier Features with PFLD using random matrix theory. Results should, however, be interpreted with care as these are typically derived in an asymptotic setting where both

and (or some more appropriate measure of complexity) go to infinity, i.e., a setting similar to the earlier mentioned thermodynamic limit. Furthermore, [182] shows that a peak can occur where the training set size equals the input dimensionality , but also when matches the number of parameters of the learner, depending on the latter’s degree of nonlinearity. Multiple peaks are also possible for [175].

6.3 Dipping and Objective Mismatch

In dipping, the learning curve may initially improve with more samples, but the performance eventually deteriorates and never recovers, even in the limit of infinite data [183], see Figure 3. Thus the best expected performance is reached at a finite training set size. By constructing an explicit problem [94, page 106], Devroye et al. already showed that the nearest neighbor classifier is not always smart, meaning its learning curve can go up locally. A similar claim is made for kernel rules [94, Problems 6.14 and 6.15]. A 1D toy problem for which many well-known linear classifiers (e.g., SVM, logistic regression, LDA, PFLD) dip is given in Figure 4. In a different context, Ben-David et al. [184] provide an even stronger example where all linear classifiers optimizing a convex surrogate loss converge in the limit to the worst possible classifier for which the error rate approaches 1.

Figure 4: A two-class problem that causes dipping of the learning curve for various linear classifiers (cf. [183]). The sample data (the two stars) illustrates that, with small samples, the optimal linear model in terms of error rate can be obtained. However, due to the surrogate loss many classifiers optimize, the decision boundary they find in the limit of infinite sample sizes will be around , which is suboptimal.

Another example, Lemma 15.1 in [94], gives an insightful case of dipping for likelihood estimation.

What is essential for dipping to occur is that the classification problem at hand is misspecified, and that the learner optimizes something else than the evaluation metric of the learning curve. Such

objective misspecification is standard since many evaluation measures such as error rate, AUC, F-measure, and so on, are notoriously hard to optimize (see, e.g., [107, page 119]). If classification-calibrated loss functions are used and the hypothesis class is rich enough to contain the true model, then minimizing the surrogate loss will also minimize the error rate [185, 184]. Thus a more complex hypothesis class may fix the dipping problem.

Other works also show dipping of some sort. For example, Frey and Fisher [43] fit C4.5 to a synthetic dataset that has binary features for which the parity of all features determines the label. When fitting C4.5 the test error increases with the amount of training samples. They attribute this to the fact that the C4.5 is using a greedy approach to minimize the error, and thus is closely related to objective misspecification. Brumen et al. [96] also shows an ill-behaving curve of C4.5 that seems to go up. They note that 34 more curves could not be fitted well using their parametric models, where possibly something similar is going on. In [186], we find another potential example of dipping as, in Figure 6, the accuracy goes down with increasing sample sizes.

Anomaly or outlier detection using

-nearest neighbors (NN) can also shows dipping behavior [153] (referred to as gravity defying learning curves). Also here is a mismatch between the objective that is evaluated with, i.e., the AUC, and NN that does not optimize the AUC. Hess and Wei [29] also show NN learning curves that deteriorate in terms of AUC in the standard supervised setting.

Also in active learning [21] for classification, where the test error rate is often plotted against the size of the (actively sampled) training set, learning curves are regularly reported to dip [187, 188]. In that case, active learners provide optimal performance for a number of labeled samples that is smaller than the complete training set. This could be interpreted as a great success for active learning. It implies that even in regular supervised learning, one should maybe use an active learner to pick a subset from one’s complete training set, as this can improve performance. It cannot be ruled out, therefore, that the active learner uses an objective that matches better with the evaluation measure [189].

Meng and Xie [152]

construct a dipping curve in the context of time series modeling with ordinary least squares. In their setting, they use an adequate parametric model, but the distribution of the noise changes every time step, which leads least squares to dipping. In this case, using the likelihood to fit the model resolves the non-monotonicity.

Finally, so-called negative transfer [190]

, as it occurs in transfer learning and domain adaptation

[191, 192], can be interpreted as dipping as well. In this case, more source data deteriorates performance on the target and the objective mismatch stems from the combined training from source and target data instead of the latter only.

6.4 Risk Monotonicity and ERM

Several novel examples of non-monotonic behavior for density estimation, classification, and regression by means of standard empirical risk minimization (ERM) are shown in [193]. Similar to dipping, the squared loss increases with , but in contrast does eventually recover, see Figure 3. However, these examples cannot be explained either in terms of dipping or peaking. Dipping is ruled out as, in ERM, the learner optimizes the loss that is used for evaluation. In addition, non-monotonicity can be demonstrated for any and so there is no direct link with the capacity of the learner, ruling out an explanation in terms of peaking.

Proofs of non-monotonicity are given for squared, absolute, and hinge loss. It is demonstrated that likelihood estimators suffer the same deficiency. Two learners are reported that are provably monotonic: mean estimation based on the loss and the memorize algorithm. The latter algorithm does not really learn but outputs the majority voted classification label of each object if it has been seen before. Memorize is not PAC learnable [107, 65], illustrating that monotonicity and PAC are essentially different concepts. It is shown experimentally that regularization can actually worsen the non-monotonic behavior. In contrast, Nakkiran [175] shows that optimal tuning of the regularization parameter can guarantee monotonicity in certain settings. A final experiment from [193] shows a surprisingly jagged curve for the absolute loss, see Figure 3.

6.5 Misspecified Gaussian Processes

Gaussian process misspecification has been studied in the regression setting where the so-called teacher model provides the data, while the student model learns, assuming a covariance or noise model different from the teacher. If they are equal, the PA curve is monotone (Section 4.6).

Sollich [194] analyzes the PA learning curve using the eigenvalue decomposition earlier covered. He assumes both student and teacher use kernels with the same eigenfunctions but possibly differing eigenvalues. Subsequently, he considers various synthetic distributions for which the eigenfunctions and eigenvalues can be computed analytically and finds that for a uniform distribution on the vertices of a hypercube, multiple overfitting maxima and plateaus may be present in the learning curve (see Figure 3), even if the student uses the teacher noise level. For a uniform distribution in one dimension, there may be arbitrarily many overfitting maxima if the student has a small enough noise level. In addition the convergence rates change and may become logarithmically slow.

The above analysis is extended by Sollich in [195], where hyperparameters such as length scale and noise level, are now optimized during learning based on evidence maximization. Among others, he finds that for the hypercube the arbitrary many overfitting maxima do not arise anymore and the learning curve becomes monotone. All in all, Sollich concludes that optimizing the hyperparameters using evidence maximization can alleviate non-monotonicity.

6.6 Misspecified Bayesian Regression

Grünwald and Van Ommen [196] show that a (hierarchical) Bayesian linear regression model can give a broad peak in the learning curve of the squared risk, see Figure 3. One way this can happen is when the homogeneous noise assumption is violated, while the estimator is otherwise consistent.

Specifically, let data be generated as follows. For each sample, a fair coin is flipped. Heads means the sample is generated according to the ground truth probabilistic model contained in the hypothesis class. Misspecification happens when the coin comes up tails and a sample is generated in a fixed location without noise.

The peak in the learning curve cannot be explained by dipping, peaking or known sensitivity of the squared loss to outliers according to Grünwald and Van Ommen. The peak in the learning curve is fairly broad and occurs in various experiments. As also no approximations are to blame, the authors conclude that Bayes’ rule is really at fault as it cannot handle the misspecification. The non-monotonicity can happen if the probabilistic model class is not convex.

Following their analysis, a modified Bayes rule is introduced, in which the likelihood is raised to some power . The parameter cannot be learned in a Bayesian way, leading to their SafeBayes approach. Their technique alleviates the broad peak in the learning curve and is empirically shown to make the curves generally more well-behaved.

6.7 The Perfect Prior

As we have seen in Subsection 4.6, the PA learning curve is always monotone if the problem is well specified and a Bayesian decision theoretical approach is followed. Nonetheless, the fact that the PA curve is monotone does not mean that the curve for every individual problem is. [197] offers an insightful example (see also Figure 3): consider a fair coin and let us estimate its probability of heads using Bayes’ rule. We measure performance using the negative log-likelihood on an unseen coin flip and adopt a uniform Beta(1,1) prior on . This prior, i.e., without any training samples, already achieves the optimal loss since it assigns the same probability to heads and tails. After a single flip, , the posterior is updated and leads to a probabilities of or and the loss must increase. Eventually, with , the optimal loss is recovered, forming a bump in the learning curve. Note that this construction is rather versatile and can create non-monotonic behavior for practically any Bayesian estimation task. In a similar way, any type of regularization can lead to comparable learning curve shapes (see also [95, 193] and Subsection 6.4).

An related example can be found in [150]. It shows that the posterior variance can also increase for a single problem, unless the likelihood belongs to the exponential family and a conjugate prior is used. GPs fall in this last class.

6.8 Monotonicity: a General Fix?

This section has noted a few particular approaches to restore monotonicity of a learning curve. One may wonder, however, whether generally applicable approaches exist that can turn any learner into a monotone one. A first attempt is made in [198] which proposes a wrapper that, with high probability, makes any classifier monotone in terms of the the error rate. The main idea is to consider as a variable over which model selection is performed. When is increased, a model trained with more data is compared to the previously best model on validation data. Only if the new model is judged to be significantly better—following a hypothesis test, the older model is discarded. If the original learning algorithm is consistent and if the size of the validation data grows, the resulting algorithm is consistent as well. It is empirically observed that the monotone version may learn more slowly, giving rise to the question whether there always will be a trade-off between monotonicity and speed (refer to the learning curve in Figure 3).

Recently, [199] extended this idea, proposing two algorithms that do not need to set aside validation data while guaranteeing monotonicity. To this end they assume that the Rademacher complexity of the hypothesis class composited with the loss is finite. This allows them to determine when to switch to a model trained with more data. In contrast to [198], they argue that their second algorithm does not learn slower, as its generalization bound coincides with a known lower bound of regular supervised learning.

7 Discussion and Conclusion

Generally, there is strong empirical evidence for learning curves being power laws. For some models, however, an exponential shape cannot be ruled out. Theory supports this and indicates that the problem and model class determine the shape. For the non-realizable case little is known theoretically. GP learning curves have been analyzed for several special cases, but their general shape remains hard to characterize and offers rich behavior such as different regimes. Ill-behaved learning curves illustrate that various shapes are possible that are hard to characterize. What is perhaps most surprising is that these behaviors can even occur in the well-specified case and realizable setting. It should be clear that, currently, there is no theory that covers all these different aspects of learning curves.

Much of the work on learning curves seems scattered and incidental. Starting with Hughes, Foley, and Raudys, some initial contributions appeared around 1970. The most organized efforts, starting around 1990, come from the field of statistical physics, with important contributions from Amari, Tishby, Opper, and others. These efforts have found their continuation within GP regression, to which Opper again contributed significantly. For GPs, Sollich probably offered some of the most complete work.

The usage of the learning curve as an tool for the analysis of learning algorithms has varied throughout the past decades. In line with Langley, Perlich, Hoiem, et al. [39, 33, 41], we would like to suggest a more consistent use. We specifically agree with Perlich et al. [33] that without a study of the learning curves, claims of superiority of one approach over the other are perhaps only valid for very particular sample sizes. Reporting learning curves in empirical studies can also help the field to move away from its fixation on bold numbers, besides accelerating learning curve research.

In the years to come, we expect investigations of parametric models and their performance in terms of extrapolation. Insights into these problems become more and more important—particularly within the context of deep learning—to enable the intelligent use of computational resources. In the remainder, we highlight some specific aspects that we see as important.

7.1 Averaging Curves and The Ideal Parametric Model

Especially for extrapolation, a learning curve should be predictable, which, in turn, asks for good parametric model. It seems impossible to find a generally applicable, parametric model that covers all aspects of the shape, in particular ill-behaving curves. Nevertheless, we can try to find a model class that would give us sufficient flexibility and extrapolative power. Power laws and exponentials should probably be part of that class, but does that suffice?

To get close to the true learning curve, some studies average hundreds or thousands of individual learning curves [23, 25]. Averaging can mask interesting characteristics of individual curves [4, 200]. This has been extensively debated in psychonomics, where cases have been made for exponentially shaped individual curves, but power-law-like average curves [201, 202]. In applications, we may need to be able to model individual, single-training-set curves or curves that are formed on the basis of relatively small samples. As such, we see potential in studying and fitting individual curves to better understanding their behavior.

7.2 How to Robustly Fit Learning Curves

A technical problem that has received little attention is how to properly fit a learning curve model. As far as current studies at all mention how the fitting is carried out, they often seem to rely on simple least squares fitting of log values, assuming independent Gaussian noise at every . Given that this noise model is not bounded—while a typical loss cannot become negative, this choice seems disputable. At the very least, derived confidence intervals and -values should be taken with a grain of salt.

A possibly more reasonable choice, especially when fitting power laws, is to consider a linear least-squares model in a log-log plot while also modeling the asymptotic value. In that case, the Gaussian noise assumption at least leads to non-negative losses, but leads to skewed errors in the original space. All in all, a probabilistic model with assumptions that more closely match the learning curve seems promising to investigate. Given the tricky nature of extrapolation, particularly when dealing with relatively small samples, further investigating robust estimation methods that match well with the intended purpose should be worthwhile.

7.3 Bounds and Alternative Statistics

One should be careful in interpreting theoretical results when it comes to the shape of learning curves. Generalization bounds, such as those provided by PAC, that hold uniformly over all may not correctly characterize the curve shape. Similarly, a strictly decreasing bounds does not imply monotone learning curves, and thus does not rule out ill-behavior. Furthermore, PA learning curves, which require an additional average over problems, can show behavior substantially different from those for a single problem, because here averaging can also mask characteristics.

Another incompatibility between typical generalization bounds and learning curves is that the former are constructed to hold with high probability with respect to the sampling of the training set, while the latter look at the mean performance over all training sets. Though bounds of one type can be transformed into the other [203], this conversion can change the actually shape of the bound, thus such high probability bounds may also not correctly characterize learning curve shape for this reason.

The preceding can also be a motivation to actually study learning curves for statistics other than the average. For example, in Equation 2

, instead of the expectation we could look at the curves of the median or other percentiles. These quantities are closer related to high probability learning bounds. Of course, we would not have to choose the one learning curve over the other. They offer different types of information and, depending on our goal, may be worthwhile to study next to each other. Along the same line of thought, we could include quartiles in our plots, rather than the common error bars based on the standard deviation. Ultimately, we could even try to visualize the full loss distribution at every sample size

and, potentially, uncover behavior much more rich and unexpected.

A final estimate that we think should be investigates more extensively is the training loss. Not only can this quantity aid in identifying overfitting and underfitting issues [200, 2], but it is a quantity that is interesting to study as such or, say, in combination with the true risk. Their difference, a simple measure of overfitting, could, for example, turn out to behave more regular than the two individual measures.

7.4 Research into Ill-Behavior and Meta-learning

We believe better understanding is needed regarding the occurrence of peaking, dipping, and otherwise non-monotonic or phase-transition-like behavior: when and why does this happen? Certainly, a sufficiently valid reason to investigate these phenomena is to quench one’s scientific curiosity. We should also be careful, however, not to mindlessly dismiss such behavior as mere oddity. Granted, these uncommon and unexpected learning curves have often been demonstrated in artificial or unrealistically simple settings, but this is done to make at all insightful that there is a problem.

The simple fact is that, at this point, we do not know what role these phenomena play in real-world problems. Now that many benchmark datasets are readily available, this issue can be studied more rigorously. Properly summarizing (see Section 2.4) and openly sharing learning curve data can further support this research. Automated techniques may then be developed to find curious learning curve phenomena and possibly predict them.

Given the success of meta-learning for curve extrapolation and model selection this seems a promising possibility. Such meta-learning studies on large amounts of datasets could, in addition, shed more light on what determines the parameters of learning curve models, a topic that has been investigated relatively little up to now. Predicting these parameters robustly from very few points along the learning curve will prove valuable for virtually all applications.

7.5 Open Theoretical Questions

There are two rather specific theoretical questions that we would like to draw attention to. Both are concerned with the monotonicity of a learner.

The first one asks whether maximum likelihood estimators for well-specified models behave monotonically. Likelihood estimation, being a century-old, classical technique [204, 205], has been heavily studied, both theoretically and empirically. In much of the theory developed, the assumption that one is dealing with a correctly specified model is common, but we are not aware of any results that demonstrate that better models are obtained with more data. The question is interesting for the likelihood exactly because this estimator has been extensively studied already and still plays a central role in statistics and abutting fields.

The second question is broader: for standard classification and regression problems, among the consistent learners are there monotonic ones? We saw that we could make them more monotonic in some settings, but the question is whether making them strictly monotonic is possible as well. Is there a general solution? Interestingly, specifically for universally consistent classification rules, Devroye et al. [94, page 106] conjecture that this is not possible.

7.6 Concluding

More than a century of learning curve research has brought us quite some insightful and surprising results. What is more striking however, at least to us, is that there is still so much that we actually do not understand about learning curve behavior. Most theoretical results are restricted to relatively basic learners, while much of the empirical research that has been carried out is quite limited in scope. In the foregoing, we identified some specific challenges already, but we are convinced that many more open and interesting problems can be discovered. In this, the current review should help both as a guide and as a reference.

Acknowledgments

We wholeheartedly acknowledge the valuable input and feedback received from Peter Grünwald, Alexander Mey, Jesse Krijthe and Robert-Jan Bruintjes.

References

  • [1] Hermann Ebbinghaus. Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. Duncker & Humblot, 1885.
  • [2] Anil K Jain, Robert P. W. Duin, and Jianchang Mao.

    Statistical pattern recognition: A review.

    TPAMI, 22(1):4–37, 2000.
  • [3] Marco Loog. Supervised classification: Quite a brief overview. In Machine Learning Techniques for Space Weather, pages 113–145. Elsevier, 2018.
  • [4] Rosa A Schiavo and David J Hand. Ten more years of error rate research. International Statistical Review, 68(3):295–310, 2000.
  • [5] Claudia Perlich. Learning curves in machine learning. In Claude Sammut and Geoffrey I. Webb, editors, Encyclopedia of Machine Learning, pages 577–488. Springer, 2010.
  • [6] Claude Sammut and Geoffrey I Webb. Encyclopedia of machine learning. Springer Science & Business Media, 2011.
  • [7] Martin Osborne. A modification of Veto logic for a committee of threshold logic units and the use of 2-class classifiers for function estimation. PhD thesis, Oregon State University, 1975.
  • [8] Les E Atlas, David A Cohn, and Richard E Ladner. Training connectionist networks with queries and selective sampling. In NeurIPS, pages 566–573, 1990.
  • [9] Haim Sompolinsky, Naftali Tishby, and H Sebastian Seung. Learning from examples in large neural networks. Physical Review Letters, 65(13):1683, 1990.
  • [10] William Lowe Bryan and Noble Harter. Studies in the physiology and psychology of the telegraphic language. Psychological Review, 4(1):27, 1897.
  • [11] Kenneth W Spence. The differential response in animals to stimuli varying within a single dimension. Psychological Review, 44(5):430, 1937.
  • [12] Edgar James Swift. Studies in the psychology and physiology of learning. The American Journal of Psychology, 14(2):201–251, 1903.
  • [13] Otto Lipmann. Der einfluss der einzelnen wiederholungen auf verschieden starke und verschieden alte associationen. Zeitschrift für Psychologie und Physiologie der Sinnesorgane, 35:195–233, 1904.
  • [14] AD Ritchie. Thinking and machines. Philosophy, 32(122):258, 1957.
  • [15] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958.
  • [16] Donald H. Foley. Considerations of Sample and Feature Size. IEEE Transactions on Information Theory, 18(5):618–626, 1972.
  • [17] Thomas M. Cover. Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition, 1965.
  • [18] Richard O Duda, Peter E Hart, and David G Stork. Pattern classification. John Wiley & Sons, 2012.
  • [19] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
  • [20] Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, 2010.
  • [21] Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, Department of Computer Sciences, 2009.
  • [22] Sayan Mukherjee, Pablo Tamayo, Simon Rogers, Ryan Rifkin, Anna Engle, Colin Campbell, Todd R Golub, and Jill P Mesirov. Estimating dataset size requirements for classifying dna microarray data. Journal of computational biology, 10(2):119–142, 2003.
  • [23] Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, and Long H Ngo. Predicting sample size required for classification performance. BMC medical informatics and decision making, 12(1):8, 2012.
  • [24] Aaron N Richter and Taghi M Khoshgoftaar. Learning curve estimation with large imbalanced datasets. In ICMLA, pages 763–768, 2019.
  • [25] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In NeurIPS, pages 841–848, 2002.
  • [26] https://waikato.github.io/weka-wiki/experimenter/learning_curves/.
  • [27] https://github.com/scikit-learn/scikit-learn/blob/0fb307bf3/sklearn/model_selection/_validation.py#L1105.
  • [28] Ji-Hyun Kim. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Computational statistics & data analysis, 53(11):3735–3745, 2009.
  • [29] Kenneth R Hess and Caimiao Wei. Learning curves in classification with microarray data. In Seminars in oncology, volume 37, pages 65–68. Elsevier, 2010.
  • [30] http://prtools.tudelft.nl/.
  • [31] Bradley Efron. Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American statistical association, 78(382):316–331, 1983.
  • [32] Anil K Jain, Richard C Dubes, and Chaur-Chin Chen. Bootstrap techniques for error estimation. TPAMI, (5):628–633, 1987.
  • [33] Claudia Perlich, Foster Provost, and Jeffrey S Simonoff. Tree Induction vs. Logistic Regression: A Learning-Curve Analysis. JMLR, 4(1):211–255, 2003.
  • [34] Foster Provost, David Jensen, and Tim Oates. Efficient progressive sampling. In ACM SIGKDD, pages 23–32, 1999.
  • [35] Eduardo Perez and Larry A Rendell. Using multidimensional projection to find relations. In Machine Learning Proceedings 1995, pages 447–455. Elsevier, 1995.
  • [36] Dominic Mazzoni and Kiri Wagstaff. Active learning in the presence of unlabelable examples. Technical report, NASA/JPL, 2004.
  • [37] Burr Settles and Mark Craven. An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 1070–1079, 2008.
  • [38] Edwin E Ghiselli. A comparison of methods of scoring maze and discrimination learning. The Journal of General Psychology, 17(1):15–28, 1937.
  • [39] P. Langley. Machine Learning as an Experimental Science. Machine Learning, 3(1):5–8, 1988.
  • [40] Nicola Bertoldi, Mauro Cettolo, Marcello Federico, and Buck Christian. Evaluating the learning curve of domain adaptive statistical machine translation systems. In Workshop on Statistical Machine Translation, pages 433–441, 2012.
  • [41] Derek Hoiem, Tanmay Gupta, Zhizhong Li, and Michal M. Shlapentokh-Rothman. Learning curves for analysis of deep networks, 2020.
  • [42] Corinna Cortes, Lawrence D Jackel, Sara A Solla, Vladimir Vapnik, and John S Denker. Learning curves: Asymptotic values and rate of convergence. In NeurIPS, pages 327–334, 1994.
  • [43] Lewis J Frey and Douglas H Fisher. Modeling decision tree performance with the power law. In AISTATS, 1999.
  • [44] Sameer Singh. Modeling performance of different classification methods: deviation from the power law. Project Report, Department of Computer Science, Vanderbilt University, USA, 2005.
  • [45] Mark Last. Predicting and optimizing classifier utility with the power law. In ICDMW, pages 219–224. IEEE, 2007.
  • [46] Baohua Gu, Feifang Hu, and Huan Liu. Modelling classification performance for large data sets. In International Conference on Web-Age Information Management, pages 317–328. Springer, 2001.
  • [47] Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, and Sriram Venkatapathy. Prediction of Learning Curves in Machine Translation. In ACL, pages 22–30, Jeju Island, Korea, 2012.
  • [48] Gordon Hughes. On the mean accuracy of statistical pattern recognizers. IEEE Trans. IT, 14(1):55–63, 1968.
  • [49] A.K. Jain and B. Chandrasekaran. Dimensionality and Sample Size Considerations in Pattern Recognition Practice. Handbook of Statistics, 2:835–855, 1982.
  • [50] SJ Raudys and AK Jain. Small sample size effects in statistical pattern recognition: recommendations for practitioners. TPAMI, 13(3):252–264, 1991.
  • [51] Robert PW Duin, Dick de Ridder, and David MJ Tax. Experiments with a featureless approach to pattern recognition. Pattern Recognition Letters, 18(11-13):1159–1166, 1997.
  • [52] Luís Torgo. Kernel regression trees. In ECML, pages 118–127, 1997.
  • [53] Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In ICLR, 2019.
  • [54] RPW Duin and DMJ Tax. Statistical pattern recognition. In Chi Hau Chen and Patrick S P Wang, editors,

    Handbook Of Pattern Recognition And Computer Vision

    , pages 3–24. World Scientific, 2005.
  • [55] Lin Chen, Yifei Min, Mikhail Belkin, and Amin Karbasi. Multiple descent: Design your own generalization curve. arXiv preprint arXiv:2008.01036, 2020.
  • [56] Jan M Van Campenhout. On the peaking of the hughes mean recognition accuracy: the resolution of an apparent paradox. IEEE Transactions on SMC, 8(5):390–395, 1978.
  • [57] Anil K Jain and Balakrishnan Chandrasekaran. Dimensionality and sample size considerations in pattern recognition practice. In P.R. Krishnaiah and L.N. Kanal, editors, Handbook of statistics, volume 2, chapter 39, pages 835–855. Elsevier, 1982.
  • [58] Sarunas Raudys and Vitalijus Pikelis. On dimensionality, sample size, classification error, and complexity of classification algorithm in pattern recognition. TPAMI, (3):242–252, 1980.
  • [59] F Vallet, J-G Cailton, and Ph Refregier. Linear and nonlinear extension of the pseudo-inverse solution for learning boolean functions. EPL (Europhysics Letters), 9(4):315, 1989.
  • [60] Robert P.W. Duin. Classifiers in almost empty spaces. In ICPR, volume 15, pages 1–7, 2000.
  • [61] Amin Zollanvari, Alex Pappachen James, and Reza Sameni. A theoretical analysis of the peaking phenomenon in classification. Journal of Classification, pages 1–14, 2019.
  • [62] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. PNAS, 116(32):15849–15854, 2019.
  • [63] M Opper, W Kinzel, J Kleinz, and R Nehl. On the ability of the optimal perceptron to generalise. Journal of Physics A: Mathematical and General, 23(11):L581, 1990.
  • [64] Timothy LH Watkin, Albrecht Rau, and Michael Biehl. The statistical mechanics of learning a rule. Rev. of Modern Physics, 65(2):499, 1993.
  • [65] Andreas Engel and Christian Van den Broeck. Statistical mechanics of learning. Cambridge University Press, 2001.
  • [66] Anil K Jain and William G Waller. On the optimal number of features in the classification of multivariate gaussian data. Pattern recognition, 10(5-6):365–374, 1978.
  • [67] Robert Pieter Wilhelm Duin. On the accuracy of statistical pattern recognizers. PhD thesis, Technische Hogeschool Delft, 1978.
  • [68] Marina Skurichina and Robert PW Duin. Stabilizing classifiers for very small sample sizes. In ICPR, volume 2, pages 891–896. IEEE, 1996.
  • [69] R.P.W. Duin.

    On the choice of smoothing parameters for parzen estimators of probability density functions.

    IEEE Transactions on Computers, 25(11):1175–1179, 1976.
  • [70] Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit. A constructive prediction of the generalization error across scales. arXiv:1909.12673, 2019.
  • [71] https://paperswithcode.com/sota.
  • [72] David Sculley, Jasper Snoek, Alex Wiltschko, and Ali Rahimi. Winner’s curse? on pace, progress, and empirical rigor. In ICLR, 2018.
  • [73] Kathrin Blagec, Georg Dorffner, Milad Moradi, and Matthias Samwald. A critical analysis of metrics used for measuring progress in artificial intelligence. arXiv:2008.02577, 2020.
  • [74] S Raudys. On the problems of sample size in pattern recognition (in Russian). In Proceedings of the 2nd All-Union Conference on Statistical Methods in Control Theory. Nauka, 1970.
  • [75] Laveen Kanal and B Chandrasekaran. On dimensionality and sample size in statistical pattern classification. Pattern recognition, 3(3):225–234, 1971.
  • [76] Sarunas Raudys and R.P.W. Duin. Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognition Letters, 19(5-6):385–392, 1998.
  • [77] Ron Kohavi.

    Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.

    In Kdd, volume 96, pages 202–207, 1996.
  • [78] Jorg Bornschein, Francesco Visin, and Simon Osindero. Small data, big decisions: Model selection in the small-data regime. arXiv:2009.12583, 2020.
  • [79] Benjamin Strang, Peter van der Putten, Jan N van Rijn, and Frank Hutter. Don’t rule out simple models prematurely: a large scale benchmark comparing linear and non-linear classifiers in openml. In IDA, pages 303–315. Springer, 2018.
  • [80] Niels Mørch, Lars K Hansen, Stephen C Strother, Claus Svarer, David A Rottenberg, Benny Lautrup, Robert Savoy, and Olaf B Paulson. Nonlinear versus linear models in functional neuroimaging: Learning curves and generalization crossover. In IPMI, pages 259–270. Springer, 1997.
  • [81] Jude W Shavlik, Raymond J Mooney, and Geoffrey G Towell. Symbolic and neural learning algorithms: An experimental comparison. Machine learning, 6(2):111–143, 1991.
  • [82] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine learning, 29(2-3):103–130, 1997.
  • [83] Chirs Harris-Jones and Troy L Haines. Sample size and misclassification: Is more always better. AMS Center for Advanced Technologies, 1997.
  • [84] RPW Duin and EM Pekalska. Pattern Recognition: Introduction and Terminology. 37 Steps, 2016.
  • [85] S Jones, S Carley, and M Harrison. An introduction to power and sample size estimation. Emergency medicine journal: EMJ, 20(5):453, 2003.
  • [86] George H John and Pat Langley. Static versus dynamic sampling for data mining. In KDD, volume 96, pages 367–370, 1996.
  • [87] Rui Leite and Pavel Brazdil. Improving progressive sampling via meta-learning on learning curves. In ECML, pages 250–261. Springer, 2004.
  • [88] Christopher Meek, Bo Thiesson, and David Heckerman. The learning-curve sampling method applied to model-based clustering. JMLR, 2(Feb):397–418, 2002.
  • [89] Katrin Tomanek and Udo Hahn. Approximating learning curves for active-learning-driven annotation. In LREC, volume 8, pages 1319–1324, 2008.
  • [90] Rui Leite and Pavel Brazdil. Predicting Relative Performance of Classifiers from Samples. In ICML, pages 497—-503, Bonn, Germany, 2005.
  • [91] Rui Leite and Pavel Brazdil. An iterative process for building learning curves and predicting relative performance of classifiers. In

    Portuguese Conference on Artificial Intelligence

    , pages 87–98. Springer, 2007.
  • [92] Jan N. van Rijn, Salisu Mamman Abdulrahman, Pavel Brazdil, and Joaquin Vanschoren. Fast algorithm selection using learning curves. In LNCS, volume 9385, pages 298–309. Springer Verlag, oct 2015.
  • [93] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep Learning Scaling is Predictable, Empirically. arXiv:1712.00409, 2017.
  • [94] Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31. Springer, New York, NY, USA, 1996.
  • [95] Tom Viering, Alexander Mey, and Marco Loog. Open problem: Monotonicity of learning. In Conference on Learning Theory, pages 3198–3201, 2019.
  • [96] Boštjan Brumen, Ivan Rozman, Marjan Heričko, Aleš Černezel, and Marko Hölbl. Best-fit learning curve model for the c4. 5 algorithm. Informatica, 25(3):385–399, 2014.
  • [97] Natthaphan Boonyanunta and Panlop Zeephongsekul. Predicting the relationship between the size of training sample and the predictive power of classifiers. In KES, pages 529–535. Springer, 2004.
  • [98] Michel Jose Anzanello and Flavio Sanson Fogliatto. Learning curve models and applications: Literature review and research directions. Int. Journal of Industr. Ergonomics, 41(5):573–583, 2011.
  • [99] S. Ahmad and G. Tesauro. Study of scaling and generalization in neutral networks. Neural Networks, 1(1):3, 1988.
  • [100] David Cohn and Gerald Tesauro. Can neural networks do better than the vapnik-chervonenkis bounds? In NeurIPS, pages 911–917, 1991.
  • [101] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting Unreasonable Effectiveness of Data in Deep Learning Era. In ICCV, pages 843–852, 2017.
  • [102] Armand Joulin, Laurens Van Der Maaten, Allan Jabri, and Nicolas Vasilache. Learning visual features from large weakly supervised data. In ECCV, pages 67–84. Springer, 2016.
  • [103] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In ECCV, pages 181–196, 2018.
  • [104] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020.
  • [105] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Trans. IT, 13(1):21–27, 1967.
  • [106] V Vapnik. Estimation of dependences based on empirical data berlin, 1982.
  • [107] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • [108] Wray L Buntine. A critique of the valiant model. In IJCAI, pages 837–842, 1989.
  • [109] Wendy E Sarrett and Michael J Pazzani. Average case analysis of empirical and explanation-based learning algorithms. Technical Report 89-35, Department of Information & Computer Science, University of California, Irvine, 1989.
  • [110] David Haussler and Manfred Warmuth. The probably approximately correct (pac) and other learning models. In Foundations of Knowledge Acquisition, pages 291–312. Springer, 1993.
  • [111] David Haussler, Michael Kearns, H. Sebastian Seung, and Naftali Tishby. Rigorous learning curve bounds from statistical mechanics [longer version]. Machine Learning, 25(2-3):195–236, 1996.
  • [112] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv:1712.00409, 2017.
  • [113] Olivier Bousquet, Steve Hanneke, Shay Moran, Ramon van Handel, and Amir Yehudayoff. A theory of universal learning. arXiv preprint arXiv:2011.04483, 2020.
  • [114] Dale Schuurmans. Characterizing rational versus exponential learning curves. In

    European Conference on Computational Learning Theory

    , pages 272–286. Springer, 1995.
  • [115] Dale Schuurmans. Characterizing rational versus exponential learning curves. journal of computer and system sciences, 55(1):140–160, 1997.
  • [116] Hanzhong Gu and Haruhisa Takahashi. Exponential or polynomial learning curves? case-based studies. Neural computation, 12(4):795–809, 2000.
  • [117] Hanzhong Gu and Haruhisa Takahashi. How bad may learning curves be? IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(10):1155–1167, 2000.
  • [118] Thomas M Cover. Rates of convergence for nearest neighbor procedures. In Proceedings of the Hawaii International Conference on Systems Sciences, volume 415, 1968.
  • [119] Shun-ichi Amari. A universal theorem on learning curves. Neural Networks, 6(2):161–166, jan 1993.
  • [120] S Amari. Universal property of learning curves under entropy loss. In IJCNN, volume 2, pages 368–373 vol.2, 1992.
  • [121] Manfred Opper and David Haussler. Calculation of the learning curve of bayes optimal classification algorithm for learning a perceptron with noise. In COLT, volume 91, pages 75–87, 1991.
  • [122] Shun-ichi Amari, Naotake Fujita, and Shigeru Shinomoto. Four types of learning curves. Neural Computation, 4(4):605–618, 1992.
  • [123] Shun-ichi Amari and Noboru Murata. Statistical Theory of Learning Curves under Entropic Loss Criterion. Neural Computation, 5(1):140–153, 1993.
  • [124] Hyunjune Sebastian Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanics of learning from examples. Physical review A, 45(8):6056, 1992.
  • [125] D Peterson. Some convergence properties of a nearest neighbor decision rule. IEEE Trans. IT, 16(1):26–31, 1970.
  • [126] D. B. Schwartz, V. K. Samalam, Sara A. Solla, and J. S. Denker. Exhaustive Learning. Neural Computation, 2(3):374–385, 1990.
  • [127] Naftali Tishby, Esther Levin, and Sara A Solla. Consistent inference of probabilities in layered networks: Predictions and generalization. In IJCNN, volume 2, pages 403–409, 1989.
  • [128] Esther Levin, Naftali Tishby, and Sara A. Solla. A statistical approach to learning and generalization in layered neural networks. COLT, 78(October):245–260, 1989.
  • [129] Leonard J Savage. The foundations of statistics. John Wiley & Sons, Inc., 1954.
  • [130] Irving John Good. On the principle of total evidence. The British Journal for the Philosophy of Science, 17(4):319–321, 1967.
  • [131] Peter D Grünwald and Joseph Y Halpern. When ignorance is bliss. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 226–234, 2004.
  • [132] Paul R Graves. The total evidence theorem for probability kinematics. Philosophy of Science, 56(2):317–324, 1989.
  • [133] Carl Edward Rasmussen and Christopher K I Williams. Gaussian processes for machine learning. Adaptive computation and machine learning. MIT Press, 2006.
  • [134] Christopher K I Williams and Francesco Vivarelli. Upper and Lower Bounds on the Learning Curve for Gaussian Processes. Machine Learning, 40(1):77–102, 2000.
  • [135] Peter Sollich and Anason Halees. Learning curves for Gaussian process regression: Approximations and bounds. Neural Computation, 14(6):1393–1428, 2002.
  • [136] Armin Lederer, Jonas Umlauft, and Sandra Hirche. Posterior Variance Analysis of Gaussian Processes with Application to Average Learning Curves. arXiv:1906.01404, 2019.
  • [137] Loic Le Gratiet and Josselin Garnier. Asymptotic analysis of the learning curve for Gaussian process regression. Machine Learning, 98(3):407–433, 2015.
  • [138] Manfred Opper and Francesco Vivarelli. General Bounds on Bayes Errors for Regression with Gaussian Processes. In NeurIPS, pages 302–308, 1998.
  • [139] Peter Sollich. Learning Curves for Gaussian Processes. In NeurIPS, pages 344–350, Denver, Colorado, USA, 1998.
  • [140] C A Michelli and G Wahba. Design problems for optimal surface interpolation. Approximation theory and applications, pages 329–348, 1981.
  • [141] Leszek Plaskota. Noisy information and computational complexity, volume 95. Cambridge University Press, 1996.
  • [142] Giancarlo Ferrari Trecate, Christopher K.I. Williams, and Manfred Opper. Finite-dimensional approximation of Gaussian processes. NeurIPS, (n 3):218–224, 1999.
  • [143] Simo Särkkä. Learning curves for gaussian processes via numerical cubature integration. In ICANN, pages 201–208. Springer, 2011.
  • [144] Manfred Opper. Regression with Gaussian processes: Average case performance. Theoretical aspects of neural computation: A multidisciplinary perspective, pages 17–23, 1997.
  • [145] Dörthe Malzahn and Manfred Opper. Learning curves for gaussian processes regression: A framework for good approximations. NeurIPS, 2001.
  • [146] Dörthe Malzahn and Manfred Opper. Learning curves for Gaussian processes models: Fluctuations and Universality. In LNCS, volume 2130, pages 271–276. Springer Verlag, 2001.
  • [147] Dörthe Malzahn and Manfred Opper. A Variational Approach to Learning Curves. In NeurIPS, pages 463–469, 2001.
  • [148] Klaus Ritter, Grzegorz W Wasilkowski, and Henryk Wozniakowski. Multivariate integration and approximation for random fields satisfying sacks-ylvisaker conditions. The Annals of Applied Probability, 5(2):518–540, 1995.
  • [149] Klaus Ritter. Almost optimal differentiation using noisy data. Journal of Approximation Theory, 86(3):293–309, 1996.
  • [150] M Fraiwan Al-Saleh and Fawaz A Masoud. A note on the posterior expected loss as a measure of accuracy in bayesian methods. Applied mathematics and computation, 134(2-3):507–514, 2003.
  • [151] David MJ Tax and Robert PW Duin. Learning curves for the analysis of multiple instance classifiers. In S+SSPR, pages 724–733. Springer, 2008.
  • [152] Xiao Li Meng and Xianchao Xie. I Got More Data, My Model is More Refined, but My Estimator is Getting Worse! Am I Just Dumb? Econometric Reviews, 33(1-4):218–250, 2014.
  • [153] Kai Ming Ting, Takashi Washio, Jonathan R. Wells, and Sunil Aryal. Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors. Machine Learning, 106(1):55–91, 2017.
  • [154] Gary M Weiss and Alexander Battistin. Generating well-behaved learning curves: An empirical study. In ICDATA, 2014.
  • [155] William Lowe Bryan and Noble Harter. Studies on the telegraphic language: the acquisition of a hierarchy of habits. Psychological Review, 6(4):345, 1899.
  • [156] Günter Vetter, Michael Stadler, and John D Haynes. Phase transitions in learning. The Journal of Mind and Behavior, pages 335–350, 1997.
  • [157] Stefano Patarnello and Paolo Carnevali.

    Learning networks of neurons with boolean logic.

    EPL (Europhysics Letters), 4(4):503, 1987.
  • [158] Géza Györgyi.

    First-order transition to perfect generalization in a neural network with binary synapses.

    Physical Review A, 41(12):7097–7100, 1990.
  • [159] Timothy L H Watkin, Albrecht Raut, and Michael Biehl. The Statistical Mechanics of Learning a Rule. Reviews of Modern Physics, 65(2):499–556, 1993.
  • [160] Kukjin Kang, Jong-Hoon Oh, Chulan Kwon, and Youngah Park. Generalization in a two-layer neural network. Physical Review E, 48(6):4805, 1993.
  • [161] Manfred Opper. Statistical Mechanics of Learning : Generalization. The Handbook of Brain Theory and Neural Networks, page 20, 1995.
  • [162] Holm Schwarze and John A Hertz. Statistical Mechanics of Learning in a Large Committee Machine. NeurIPS, pages 523–530, 1993.
  • [163] D. Hansel, G. Mato, and C. Meunier. Memorization without generalization in a multilayered neural network. Epl, 20(5):471–476, 1992.
  • [164] Manfred Opper. Learning to generalize. Frontiers of Life, 3(part 2):763–775, 2001.
  • [165] H Seung. Annealed theories of learning. Neural Networks: The Statistical Mechanics Perspective, Proceedings of the CTP-PRSRI Joint Workshop on Theoretical Physics. Singapore, World Scientific, 1995.
  • [166] H Sompolinsky. Theoretical issues in learning from examples. In NEC Research Symposium, pages 217–237, 1993.
  • [167] M Biehl and A Mietzner. Statistical mechanics of unsupervised learning. EPL (Europhysics Letters), 24(5):421, 1993.
  • [168] David C Hoyle and Magnus Rattray. Statistical mechanics of learning multiple orthogonal signals: asymptotic theory and fluctuation effects. Physical review E, 75(1):016101, 2007.
  • [169] Niels Ipsen and Lars Kai Hansen.

    Phase transition in PCA with missing data: Reduced signal-to-noise ratio, not sample size!

    In ICML, pages 2951–2960, 2019.
  • [170] Riyaz Ahmad Bhat, Naman Jain, Ashwini Vaidya, Martha Palmer, Tafseer Ahmed, Dipti Misra Sharma, and James Babani. Adapting predicate frames for urdu propbanking. In Workshop on Language Technology for Closely Related Languages and Language Variants, pages 47–55, 2014.
  • [171] Preetum Nakkiran. More data can hurt for linear regression: Sample-wise double descent. arXiv:1912.07242, 2019.
  • [172] Marina Skurichina and Robert P.W. Duin. Stabilizing classifiers for very small sample sizes. In ICPR, volume 2, pages 891–896. IEEE, 1996.
  • [173] Jesse H Krijthe and Marco Loog. The peaking phenomenon in semi-supervised learning. In S+SSPR), pages 299–309. Springer, 2016.
  • [174] Volker Tresp. The equivalence between row and column linear regression. Technical report, Siemens, 2002.
  • [175] Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. arXiv:2003.01897, 2020.
  • [176] R P W Duin. Small sample size generalization. 9th Scandinavian Conference on Image Analysis, (May):1–8, 1995.
  • [177] Marina Skurichina and Robert P W Duin. Regularisation of Linear Classifiers by Adding Redundant Features. Pattern Anal. Appl., 2(1):44–52, 1999.
  • [178] Manfred Opper and Robert Urbanczik.

    Universal learning curves of support vector machines.

    Physical Review Letters, 86(19):4410, 2001.
  • [179] Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning. Journal of Physics A, 52(47):474001, 2019.
  • [180] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv:1710.03667, 2017.
  • [181] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv:1903.08560, 2019.
  • [182] Stéphane d’Ascoli, Levent Sagun, and Giulio Biroli. Triple descent and the two kinds of overfitting: Where & why do they appear? arXiv:2006.03509, 2020.
  • [183] Marco Loog and Robert P W Duin. The dipping phenomenon. In S+SSPR, pages 310–317, Hiroshima, Japan, 2012.
  • [184] Shai Ben-david, David Loker, Nathan Srebro, and Karthik Sridharan. Minimizing the misclassification error rate using a surrogate convex loss. ICML, pages 1863–1870, 2012.
  • [185] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Large margin classifiers: convex loss, low noise, and convergence rates. In NeurIPS, pages 1173–1180, 2004.
  • [186] Joaquin Vanschoren, Bernhard Pfahringer, and Geoffrey Holmes. Learning from the past with experiment databases. In Pacific Rim International Conference on Artificial Intelligence, pages 485–496. Springer, 2008.
  • [187] Greg Schohn and David Cohn. Less is more: Active learning with support vector machines. In ICML, volume 2, page 6, 2000.
  • [188] Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Introducing geometry in active learning for image segmentation. In CVPR, pages 2974–2982, 2015.
  • [189] Marco Loog and Yazhou Yang. An empirical investigation into the inconsistency of sequential active learning. In ICPR, pages 210–215. IEEE, 2016.
  • [190] Zirui Wang, Zihang Dai, Barnabás Póczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In CVPR, pages 11293–11302, 2019.
  • [191] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big data, 3(1):9, 2016.
  • [192] Wouter Marco Kouw and Marco Loog. A review of domain adaptation without target labels. TPAMI, 2019.
  • [193] Marco Loog, Tom Viering, and Alexander Mey. Minimizers of the empirical risk and risk monotonicity. In NeurISP, pages 7478–7487, 2019.
  • [194] Peter Sollich. Gaussian Process Regression with Mismatched Models. In NeurIPS, pages 519–526, 2002.
  • [195] Peter Sollich. Can Gaussian Process Regression Be Made Robust Against Model Mismatch? In Deterministic and Statistical Methods in Machine Learning, pages 199–210, 2004.
  • [196] Peter Grünwald and Thijs van Ommen. Inconsistency of bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4):1069–1103, 2017.
  • [197] Peter D. Grünwald and Wojciech Kotłowski. Bounds on individual risk for log-loss predictors. JMLR, 19:813–816, 2011.
  • [198] Tom Julian Viering, Alexander Mey, and Marco Loog. Making learners (more) monotone. In IDA, pages 535–547. Springer, 2020.
  • [199] Zakaria Mhammedi and Hisham Husain. Risk-monotonicity in statistical learning. arXiv preprint arXiv:2011.14126, 2020.
  • [200] David J Hand. Construction and assessment of classification rules. Wiley, 1997.
  • [201] Richard B Anderson and Ryan D Tweney. Artifactual power curves in forgetting. Memory & Cognition, 25(5):724–730, 1997.
  • [202] Andrew Heathcote, Scott Brown, and Douglas JK Mewhort. The power law repealed: The case for an exponential law of practice. Psychonomic bulletin & review, 7(2):185–207, 2000.
  • [203] Alexander Mey. A note on high-probability versus in-expectation guarantees of generalization bounds in machine learning. arXiv:2010.02576, 2020.
  • [204] Ronald A. Fisher. An absolute criterion for fitting frequency curves. Messenger of Mathematics, 41:155–160, 1912.
  • [205] Stephen M. Stigler. The epic story of maximum likelihood. Statistical Science, 22(4):598–620, 2007.