Learning curves are an important diagnostic tool that provide researchers and practitioners with insight into a learner’s generalization behavior 2. Learning curves plot the (estimated) true performance against the number of training samples. Among other things, they can be used to compare different learners to each other. This can highlight the differences due to their complexity, with the simpler learners performing better in the small sample regime, while the more complex learners perform best with large sample sizes. In combination with a plot of their (averaged) resubstitution error (or training error), they can also be employed to diagnose underfitting and overfitting. Moreover, they can aid when it comes to making decision about collecting more data or not by extrapolating them to sample sizes beyond the ones available.
It seems intuitive that learners become better (or at least do not deteriorate) with more training data. With a bit more reservation, Shalev-Shwartz and Ben-David  state, for instance, that “it must start decreasing once the training set size is larger than the VC-dimension” (page 153). The large majority of researchers and practitioners (that we talked to) indeed take it for granted that learning curves show improved performance with more data. Any deviations from this they contribute to the way the experiments are set up, to the finite sample sizes one is dealing with, or to the limited number of cross-validation or bootstrap repetitions one carried out. It is expected that if one could sample a training set ad libitum and measure the learner’s true performance over all data, such behavior disappears. That is, if one could indeed get to the performance in expectation over all test data and over all training samples of a particular size, performance supposedly improves with more data.
We formalize this behavior of expected improved performance in Section 3. As we will typically express a learner’s efficiency in term of the expected loss, we will refer to this notation as risk monotonicity. Section 4 then continues with the main contribution of this work and demonstrates that various well-known empirical risk minimizers can display nonmonotonic behavior. Moreover, we show that for these learners this behavior can persist indefinitely, i.e., it can occur at any sample size. Note: all proofs can be found in the supplement. Section 5 provides some experimental evidence for some cases of interest that have, up to now, resisted any deeper theoretical analysis. Section 6 then provides a discussion and concludes the work. In this last section, among others, we contrast our notion of risk monotonicity to that of PAC-learnability and consider various research questions of interest to further refine our understanding of learning curves. Though many will find our findings surprising, counterintuitive behavior of the learning curve has been reported before in various other settings. Section 2 goes through these and other related works and puts our contribution in perspective.
2 Earlier Work and Its Relation to the Current
We split up our overview into the more regular works that characterize monotonic behavior and those that identify the existence of nonmonotonic behavior.
2.1 The Monotonic Character of Learning Curves
. These early investigations were done in the context of neural networks and in their analyses typically make use of tools from statistical mechanics. A statistical inference approach is studied by Amari et al. and Amari and Murata , who demonstrate the typical power-law behavior of the learning curved that is obtained asymptotically. Haussler et al.  bring together many of the techniques and results from the aforementioned works. At the same time, they advance the theory for learning curves and provide an overview of the rather diverse, though still monotonic, behavior they can exhibit. In particular, the curve may display multiple steep and sudden drops in the risk.
Already in 1979, Micchelli and Wahba  provide a lower bound for learning curves of Gaussian processes. Only at the end of the 1990s and beginning of the 2000s, the overall attention shifted from neural networks to Gaussian processes. In this period, various works were published that introduce approximations and bounds [13, 14, 15, 16, 17]. Different types of techniques were employed in these analyses, some of which, again, from statistical mechanics. The main caveat, when it comes to the results obtained, is the assumption that the model is correctly specified.
The focus of 
is on support vector machines. They develop efficient procedures for a reliable extrapolation of the learning curve, so that if only limited computational resources are available, these can possibly be assigned to the most promising approaches. It is assumed that, for large enough training set sizes, the error rate converges towards a stable value following a power-law. This behavior was established to hold in many of the aforementioned works. The ideas that put forward have found use in specific applications (see, for instance, ) and can count on renewed interest these days, especially in combination with computation-time devouring neural networks Hestness et al. .
All of the aforementioned works study and derive learning curve behavior that shows no deterioration with growing training set sizes, even though they may be described as “learning curves with rather curious and dramatic behavior” . Our work identifies aspects that, in a sense, are more curious and more dramatic: with a larger training set, performance can deteriorate, even in expectation.
2.2 Early Noted Nonmonotonic Behavior
Possibly the first to point out that learning curves can show such nonmonotonic behavior was Duin 
who looked at the error rate of so-called Fisher’s linear discriminant. In this context, Fisher’s linear discriminant is used as a classifier and equivalent to the two-class linear classifier that is obtained by optimizing the squared loss over the training set. This can be solved by regressing the input feature vectors on a/ encoding of the class labels. In case the number of training samples is smaller than or equal to the number of input dimensions, one needs to deal with the inverse of singular matrices and typically resorts to the use of the Moore-Penrose pseudo-inverse. In this way, the minimum norm solution is obtained . It is exactly in this undersampled setting, as the number of training samples approaches the dimensionality, that the error rate will be increasing. Around the same time, Opper and Kinzel 
show that in the context of neural networks a similar behavior is observed for small samples. In particular, the error rate for the single layer perceptron is demonstrated to increase when the training set size goes towards the dimensionality of the data.
Since these two early works, various other examples of this type of nonmonotonic behavior have been reported. Worth mentioning are classifiers built based on the lasso  and two rather recent works that have trigger renewed attention to this subject in the neural networks community [26, 27]. The classifier reaching a maximum error rate when the sample size transits from an underspecified to an overspecified setting may be referred to as peaking in early works. The two recent works mentioned refer to it as double descent and jamming, respectively.
A completely different phenomenon, and yet other way in which learning curves can be nonmonotonic, is described by Loog and Duin . Their work shows that there are learning problems for which specific classifiers attain their optimal expected 0-1 loss at a finite sample size. For such problems, these classifiers therefore perform essentially worse with an infinite amount of training data compared to their finite sample performance. The behavior is referred to as dipping, as it creates a dip in the learning curve of the error rate. Loog  then argues that if one cannot even guarantee improvements in 0-1 loss when receiving more labeled data, that such improvements by means of unlabeled data are certainly impossible. He then shows that for generative models, when focusing on the loss the model optimizes, one can get to demonstrable improvements (see also ). Our work shows, however, that also when one looks at the loss the learner optimizes, there are not necessarily performance guarantees.
The dipping behavior hinges both on the fact that the model is misspecified and that the classifier does not optimize what it is ultimately evaluated with. That this setting can cause problems, e.g. convergence to the wrong solution, had already been demonstrated for the maximum likelihood by Devroye et al. [31, Lemma 15.1]. If the model class is flexible enough, this discrepancy disappears in many a setting. This happens, for instance, for the class of classification-calibrated surrogate losses . Ben-David et al.  analyze the consequence of the mismatch between surrogate and zero-one loss in some more detail and provide another example of a problem distribution on which such classifiers would dip.
Our results strengthen or extend the above findings in the following ways. First of all, we show that nonmonotonic behavior can occur in the setting where the complexity of the learner is small compared to the training set size. Therefore, the reported behavior is not due to jamming or peaking. Secondly, we are going to evaluate our learners by means of the loss they actually optimize for. If we look at the linear classifier that optimizes the hinge loss, for instance, we will study its learning curve for the hinge loss as well. In other words, there is no discrepancy between the objective used during training and the loss used at test time. Therefore, possibly odd behavior cannot be explained by dipping. As a third, we do not only look at classifiers and regression but also consider density estimation and (negative) log-likelihood estimation in particular.
3 Risk Monotonicity
With this section, the more formal part of this work starts. We define the intuition that with one additional instance a learner should improve its performance in expectation over the training set. In the next section, we then study various learners with the notions developed here. First, however, some notations and prior definitions.
We let be a training set of size , sampled i.i.d. from a distribution over a general domain . Also given is a hypothesis class
and a loss functionthrough which the performance of a hypothesis is measured. The objective is to minimize the expected loss or risk under the distribution , which is given by
A learner is a particular mapping from the set of all samples to elements from the prespecified hypothesis class . That is, . We are particularly interested in learners that provide a solution which minimizes the empirical risk over the training set:
Most common classification, regression, and density estimation problems can be formulated in such terms. Examples are the earlier mentioned Fisher’s linear discriminant, support vector machines, and Gaussian processes, but also maximum likelihood estimation, linear regression, and the lasso can be cast in these terms. We consider all of these learners.
3.2 Degrees of Monotonicity
The basic definition is the following.
Definition 1 (local monotonicity)
A learner is -monotonic with respect to a distribution , a loss , and an integer if
This expresses exactly how we would expect a learner to behave locally (i.e., at a specific training sample size ): given one additional training instance, we expect the learner to improve. Based on our definition of local monotonicity, we can construct stronger desiderata that are, potentially, of more interest.
The two entities we would like to get rid of in the above definition are the and the . The former, because we would like our learner to act monotonically irrespective of the sample size. The latter, because we typically do not know the underlying distribution. For now, getting rid of the loss is maybe too much to ask for. First of all, not all losses are compatible with one another, as they may act on different types of and . But even if they take the same types of input, a learner is typically designed to minimize one specific loss and there seems to be no direct reason for it to be monotonic in terms of another. It seems less likely, for example, that an SVM is risk monotonic in terms of the squared loss. (We will nevertheless return to this matter in Section 6.) We exactly focus on the empirical risk minimizers as they seem to be the most appropriate candidates to behave monotonically in terms of their own loss.
Though we typically do not know , we do know in which domain we are operating. Therefore, the following definition is suitable.
Definition 2 (local -monotonicity)
A learner is (locally) -monotonic with respect to a loss and an integer if, for all distributions on , it is -monotonic.
When it comes to , the peaking phenomenon shows that, for some learners, it may be hopeless to demand local monotonicity for all . What we still can hope to find is an , such that for all , we find the learner to be locally risk monotonic. As properties like peaking may change with the dimensionality—the complexity of the classifier is generally dependent on it, the choice for will typically have to depend on the domain.
Definition 3 (weak -monotonicity)
A learner is weakly -monotonic with respect to a loss if there is an integer such that for all , the learner is locally -monotonic.
Given the domain, one may of course be interested in the smallest for which weak -monotonicity is achieved. If it does turn out that can be set to , the learner is said to be globally -monotonic.
Definition 4 (global -monotonicity)
A learner is globally -monotonic with respect to a loss if for every integer , the learner is locally -monotonic.
4 Theoretical Results
We consider the hinge loss, the squared loss, and the absolute loss and linear models that optimize the corresponding empirical loss. In essence, we demonstrate that, there are various domains for which for any choice of , these learners are not weakly -monotonic. For the log-likelihood, we basically prove the same: there are standard learners for which the (negative) log-likelihood is not weakly -monotonic for any . The first three losses can all be used to build classifiers: the first is at the basis of SVMs, while the second gives rise to Fisher’s linear discriminant in combination with linear hypothesis classes. The second and third loss are of course also employed in regression. The log-likelihood is standard in density estimation.
4.1 Learners that Do Behave Monotonically
Before we actually move to our negative results, we first provide some examples that point in a positive direction. The first learner is provably risk monotonic over a large collection of domains. For the second learner we do not offer a rigorous proof of monotonicity, but we make a simple argument that shows that it is reasonable to expect that it does behave as such. We finally recall the memorize algorithm : a monotonic leaner similar to the second one.
Fitting a normal distribution with fixed covariance and unknown mean.
Let be an invertible -matrix,
, and take the loss to equal the negative log-likelihood.
If is bounded, the learner is globally -monotonic.
Using similar arguments, one can show that the learner with and Mahalanobis loss , with a positive semi-definite matrix, is globally -monotonic as well as long as is bounded.
Fitting a categorical distribution through maximum likelihood.
Given a random variablethat can take on categorical values, which we just take to be the integers to . We need to estimate the
discrete probabilitieswith . If is the number of times has been observed among the training samples, the likelihood estimate for equals .
We get a risk, i.e., the overall negative log-likelihood, that is positive infinite if there is at least one category that was not observed at training time. In all other cases the log-likelihood is finite. Now, the probability of not observing a single sample for at least one of the categories equals one minus the probability of observing at least one sample from very category. The latter probability will only go up with more training samples, and so the probability of obtaining estimates that result in a negative log-likelihood that is infinite will decrease with increasing sample sizes. It therefore seems reasonable to expect that this learner acts risk monotonically.
One route towards making a rigorous argument could be to consider well-behaved estimates based on Laplace smoothing  and to let the additive smoothing parameters shrink to zero.
The memorize algorithm .
When evaluated on a test input object that is also present in the training set, this classifier returns the label of said training object. In case multiple training examples share the same input, the majority voted label is returned. In case the test object is not present in the training set, a default label is returned. This learner is monotonic for any distribution under the zero-one loss as it only updates its decision on points that it observes.
4.2 Learners that Don’t Behave
To show for various learners that, to a large extent, they do not behave risk monotonically, we construct specific discrete distributions for which we can explicitly proof nonmonotonicity. What leads to the sought-after counterexamples in our case, is a distribution where a small part of the density is located relatively far away. In particular, shrinking the probability of this part towards 0 leads us to the lemma below. It is used in the subsequent proofs, but is also of some interest in itself.
Let be a domain with two elements from , let
be a training set with samples, and let . If
then is not locally -monotonic.
For many losses, we have, in fact, that , which further simplifies the difference of interest to .
Linear hypotheses, squared loss, absolute loss, and hinge loss.
We consider linear models without bias in dimensions, so take and . Though not crucial to our argument, we select the minimum-norm solution in the underdetermined case. is the general minimizer of the risk in this setting. For the squared loss, we have for any . The absolute loss is given by and the hinge loss is defined as . Both the absolute loss and the squared loss can be used for regression and classification. The hinge loss is appropriate only for the classification setting. Therefore, though the rest of the setup remains the same, outputs are limited to the set for the hinge loss.
Consider a linear without intercept and assume it either optimizes the squared, the absolute, or the hinge loss. Assume contains at least one nonzero element. If there exists an open ball that contains the origin, such that , then this risk minimizer is not weakly -monotonic for any .
Fitting a normal distribution with fixed mean and unknown variance (in one dimension).
We follow up on the example where we fitted a normal distribution with fixed covariance and unknown mean, but we limit ourselves now to one dimension only. More importantly, we now fix the mean (to 0, arbitrarily) and have the variance as the unknown. Specifically, let, , and take the loss to equal the negative log-likelihood.
If there exists an open ball that contains the origin, such that , then estimating the variance of a one-dimensional normal density is not weakly -monotonic for any .
5 Experimental Evidence
Our results from the previous section, already show cogently that the behavior of the learning curve can be an interesting object to study. Nevertheless, we want to complement our theoretical findings with a few illustrative experiments to further strengthen this point. This also involves a setting that we were unable to theoretically analyze at this point. On different distributions, we show nonmonotonic behaviour for the squared loss and the absolute loss in a regression setting. The particular results are on display in Figure 1.
For these numerical examples we consider a one-dimensional input space. For Figures 0(a), 0(b), and 0(c), we have a distribution with two points: and , with the first coordinate the input and the second the corresponding output. We set and . For Figure 0(a), , for Figure 0(b), , for Figure 0(c), . For Figure 0(c) we also used a small amount of regularization, with set to . The distribution for Figure 0(d) is supported on three points, , and . We have , , and , with again the first coordinate as the input and the second the corresponding output. Furthermore, in this case, , , and .
The first thing to note maybe is the rather serrated and completely nonmonotonic behavior of the learning curve for the absolute loss in Figure 0(b). Also very interesting is that regularization does not solve the problem. In fact, it can make the problem worse as shown in Figure 0(c). When we add a small regularizer of the form to the empirical risk to obtain the learner , this learner shows nonmonotonic behaviour, while is monotonic under the same distribution. Figure 0(a) show very clear how the expected squared loss can initially grow with more data.
In the final example in Figure 0(d), we consider the distribution supported on three points and we consider linear regression with the squared loss that includes a bias term. This is a setting we have not been able to analyze theoretically. It is of interest because the usual configuration for standard learners includes a bias and one could get the impression from our theoretical results (and maybe in particular the proofs) that the origin plays a major role in the bad behavior of some learners. But as can be observed here, adding an intercept, and taking away the possibly special status of the origin, does not make risk nonmonotonicity go away.
6 Discussion and Conclusion
It should be clear that this paper does not get to the bottom of the learning-curve issue. In fact, one of the reasons of this work is to bring it to the attention of the community. We are convinced that it raises a lot of interesting and interrelated problems that may go far beyond the initial analyses we offer here. Further study should bring us to a better understanding of how learning curves can actually act, which, in turn, should enable practitioners to better interpret and anticipate their behavior.
What this work does convey is that learning curves can (provably) show some rather counterintuitive and surprising behavior. In particular, we have demonstrated that least squares regression, regression with the absolute loss, linear models trained with the hinge loss, and likelihood estimation of the variance of a normal distribution can all suffer from nonmonotonic behavior, even when evaluated with the loss they optimize for. All of these are standard learners, using standard loss functions.
Anyone familiar with the theory of PAC learning may wonder how our results can be reconciliated with the bounds that come from this theory. At a first glance, our observations may seem to contradict this theory. Learning theory dictates that if the hypothesis class has finite VC-dimension, the excess risk of ERM will drop as in the realizable case and as in the agnostic case [36, 2]. Thus PAC bounds give an upper bound on the excess risk that will be tighter given more samples. PAC bounds hold with a particular probability, but we are concerned with the risk in expectation. Even bounds that hold in expectation over the training sample will, however, not rule out nonmonotonic behaviour. This is because in the end the guarantees from PAC learning are indeed merely bounds. Our analysis show that within those bounds, we cannot always expect risk monotonic behavior.
In light of the learning rates mentioned above, we wonder whether there are deeper links with Lemma 1 (see also Remark 2). Rewrite Equation (7) to find that we do not have local monotonicity at in case
With large enough, we can ignore the first term in the numerator. So if a learner, in this particular setting, does not learn an instance at least at a rate of in terms of the loss, it will display nonmonotonic behavior. According to learning theory, for agnostic learners, the fraction between to subsequent of losses is of the order , which is always larger than for . Can one therefore generally expect nonmonotonic behavior for any agnostic learner? Our normal mean estimation problem shows it cannot. But then, what is the link, if any?
As already hinted at in the introduction, our findings may also warrant revisiting the results obtained by Loog  and Krijthe and Loog . These works show that there are some semi-supervised learners that allow for essentially improved performance over the supervised learner. In , added unlabeled data can even be exploited to enforce almost sure improvements (i.e., with probability one) in terms of the log-likelihood. Though this is the transductive setting, this may in a sense just shows how strong their results are. In the end, their estimation procedures really is rather different from empirical risk minimization, but it does beg the question whether similar constructs can be used to get to risk monotonic procedures.
Another question, related to the remark in the previous sentence, seems of interest: could it be that the use of particular losses at training time leads to monotonic behavior at test time? Or can regularization still lead to more monotonic behavior, e.g. by explicitly limiting ? Maybe particular (upper-bounding) convex losses could turn out to behave risk monotonic in terms of specific nonconvex losses? Dipping seems to show, however, that this may very well not be the case. So should we expect it to be the other way round? That nonconvex losses can bring us to monotonicity guarantees on convex ones? And of course, monotonicity properties of nonconvex learners are also of interest to study in their own respect.
An ultimate goals would of course be to fully characterize when one can have risk monotonic behavior and when not. At this point we do not have a clear idea to what extent this would at all be possible. We were, for instance, not able to analyze some standard, seemingly simple cases, e.g. simultaneously estimating the mean and the variance of a normal model. And maybe we can only get to rather weak results. Only knowledge about the domain may turn out to be insufficient and we need to make assumptions on the class of distributions we are dealing with (leading to some notion of weakly -monotonicity?). For a start, we could study likelihood estimation under correctly specified models, for which generally there turn out to be remarkably few finite-sample results.
All in all, we are convinced that our theoretical results, strengthened by some illustrative examples, indicate that monotonicity is an interesting problem to study and that it can have important implications for both theorists and practitioners.
-  Tom Viering, Alexander Mey, and Marco Loog. Open problem: Monotonicity of learning. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 3198–3201, Phoenix, USA, 25–28 Jun 2019.
-  Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
-  Naftali Tishby, Esther Levin, and Sara A. Solla. Consistent inference of probabilities in layered networks: Predictions and generalization. In International Joint Conference on Neural Networks, volume 2, pages 403–409, 1989.
-  Esther Levin, Naftali Tishby, and Sara A. Solla. A statistical approach to learning and generalization in layered neural networks. Proceedings of the IEEE, 78(10):1568–1574, 1990.
-  Haim Sompolinsky, Naftali Tishby, and H. Sebastian Seung. Learning from examples in large neural networks. Physical Review Letters, 65(13):1683, 1990.
Manfred Opper and David Haussler.
Calculation of the learning curve of bayes optimal classification
algorithm for learning a perceptron with noise.
Proceedings of the fourth annual workshop on Computational learning theory, pages 75–87. Morgan Kaufmann Publishers Inc., 1991.
-  H.S. Seung, Haim Sompolinsky, and Naftali Tishby. Statistical mechanics of learning from examples. Physical Review A, 45(8):6056, 1992.
David Haussler, Michael Kearns, Manfred Opper, and Robert Schapire.
Estimating average-case learning curves using bayesian, statistical physics and vc dimension methods.In Advances in Neural Information Processing Systems, pages 855–862, 1992.
-  Shun-ichi Amari, Naotake Fujita, and Shigeru Shinomoto. Four types of learning curves. Neural Computation, 4(4):605–618, 1992.
-  Shun-Ichi Amari and Noboru Murata. Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5(1):140–153, 1993.
-  David Haussler, Michael Kearns, H. Sebastian Seung, and Naftali Tishby. Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25(2-3):195–236, 1996.
Charles A. Micchelli and Grace Wahba.
Design problems for optimal surface interpolation.Technical Report 565, Department of Statistics, Wisconsin University, 1979.
-  Manfred Opper. Regression with gaussian processes: Average case performance. In Theoretical aspects of neural computation: A multidisciplinary perspective, pages 17–23. Springer, 1998.
-  Peter Sollich. Learning curves for gaussian processes. In Advances in Neural Information Processing Systems, pages 344–350, 1999.
-  Manfred Opper and Francesco Vivarelli. General bounds on bayes errors for regression with gaussian processes. In Advances in Neural Information Processing Systems, pages 302–308, 1999.
-  Christopher K.I. Williams and Francesco Vivarelli. Upper and lower bounds on the learning curve for gaussian processes. Machine Learning, 40(1):77–102, 2000.
-  Peter Sollich and Anason Halees. Learning curves for gaussian process regression: Approximations and bounds. Neural Computation, 14(6):1393–1428, 2002.
-  Corinna Cortes, Lawrence D. Jackel, Sara A. Solla, Vladimir N. Vapnik, and John S. Denker. Learning curves: Asymptotic values and rate of convergence. In Advances in Neural Information Processing Systems, pages 327–334, 1994.
-  Prasanth Kolachina, Nicola Cancedda, Marc Dymetman, and Sriram Venkatapathy. Prediction of learning curves in machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 22–30. Association for Computational Linguistics, 2012.
-  Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Mostofa Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
-  Robert P.W. Duin. Small sample size generalization. In Proceedings of the Scandinavian Conference on Image Analysis, volume 2, pages 957–964, 1995.
-  Alexander J. Smola, Peter J. Bartlett, Dale Schuurmans, and Bernhard Schölkopf. Advances in Large Margin Classifiers. MIT Press, 2000.
-  Manfred Opper and Wolfgang Kinzel. Statistical mechanics of generalization. In Models of Neural Networks III, pages 151–209. Springer, 1996.
-  Manfred Opper. Learning to generalize. Frontiers of Life, 3(part 2):763–775, 2001.
-  Nicole Krämer. On the peaking phenomenon of the lasso in model selection. arXiv preprint arXiv:0904.4416, 2009.
-  Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018.
-  Stefano Spigler, Mario Geiger, Stéphane d’Ascoli, Levent Sagun, Giulio Biroli, and Matthieu Wyart. A jamming transition from under-to over-parametrization affects loss landscape and generalization. arXiv preprint arXiv:1810.09665, 2018.
Marco Loog and Robert P.W. Duin.
The dipping phenomenon.
Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 310–317. Springer, 2012.
-  Marco Loog. Contrastive pessimistic likelihood estimation for semi-supervised classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):462–475, 2016.
Jesse H. Krijthe and Marco Loog.
The pessimistic limits and possibilities of margin-based losses in semi-supervised learning.In Advances in Neural Information Processing Systems, pages 1790–1799, 2018.
-  Luc Devroye, László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.
-  Peter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
-  Shai Ben-David, David Loker, Nathan Srebro, and Karthik Sridharan. Minimizing the misclassification error rate using a surrogate convex loss. In Proceedings of the 29th International Conference on Machine Learning, pages 83–90, 2012.
-  Shai Ben-David, Nathan Srebro, and Ruth Urner. Universal learning vs. no free lunch results. In Philosophy and Machine Learning Workshop NIPS, 2011.
-  David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
-  Vladimir N. Vapnik. Statistical Learning Theory. Wiley, 1998.
The minimizing hypothesis of the empirical risk is attained for the mean that equals . Equivalently, we have for the parameter value that defines . Let
be the true cumulative distribution function for a single observationand let be the true cumulative distribution function for . For simplicity, in what follows, all integrals are taken over and the density outside of is simply taken to be equal to 0. The negative log-likelihoods in expectation over the samples equals
Following Equation (4), we consider the difference between the above term and the one corresponding to training samples. See Equation (11); as only the last term differs in the expressions for and samples, we find that this difference equals
is bounded, so the (noncentral) second moment matrixexists and the difference simplifies to
This proves that the learner is globally -monotonic.
Let and . The expected risk over then equals
The derivative to of the above equals
Taking the limit , all terms become zero for . For , we get and, for , we get . Similarly, for a training sample size of , the only nonzero terms we get are for , as the expression for the derivative is essentially the same.
It shows that the -derivative evaluated in of the difference in expected risk from Equation (4) equals , which can be further simplified to , as and .
If this derivative is strictly larger than , continuity in implies that there is a such that the actual risk difference becomes positive. This shows that is not locally -monotonic.
Let us first consider the squared loss. Take and , such that the input vectors, and , which constitute the first coordinates are in . The variables and constitute the outputs. Let both first input coordinates and not be equal to 0. All other input coordinates do equal 0. In this case, all (minimum-norm) hypotheses are finite and Remark 2 applies to this setting. So we study whether in order to be able to invoke Lemma 1. To do so, we exploit that we can determine in closed-form. As all input variation occurs in the first coordinate only, we have that , which implies that . In the same way we, find that . Now take the limit of to to obtain . For any bounded away from 0, this shows that for all there is a , small enough, such that . This shows in turn that there exists a and a corresponding , such that under the squared loss is not locally -monotonic. As this holds for all , we conclude that it also is not weakly -monotonic for any .
For the absolute loss, we consider the same setting as for the squared loss and its very beginning proceeds along the exact same lines. The proof starts to deviate at the calculation of and . Still the same as for the squared loss, as all input variation occurs in the first coordinate, we only have to study what happens in that subspace. This means that all other elements of the minimum-norm solutions we consider will be 0. As is the empirical risk minimizer for one and s, we have
where is the first element of . We can rewrite the main part of the objective function as
From this, one readily sees that the first coordinate of the minimizer equals if and if . If , then it picks as we are looking for the minimum-norm solution. For that same reason, all other entries of equal 0. Similar expressions, with substituted for , hold for . If we take , then we get , which is larger than 0 if . Again along the same lines as for the squared loss, this shows that regression using the absolute loss is not locally -monotonic and, as this holds for all , we conclude that it is not weakly -monotonic for any .
Finally, the hinge loss. As we are necessarily dealing with a classification setting now, and are in . Now, take , , and . Any choice of can only classify either or correctly, as both and are positive. With this, the empirical risk becomes and only solutions for which the first coordinate is in need to be consider, as values outside of this interval will only increase the loss for either or , while the loss remains the same for the other value. Being limited to the interval implies . So we will find exactly the same solutions as we found for the absolute loss, but with and limited to .
Take and to be in . As opposed to the proof for Theorem 2, we now cannot use the suggestion from Remark 2, as for the log-likelihood it does not hold that . Therefore, we need to look at the full expression of Lemma 1: . The sigma that belongs to the empirical risk minimizing hypothesis equals . For it is and for we get . Therefore, we come to the following negative log-likelihoods:
Now, consider the limit of going to 0. The last two negative log-likelihoods are finite in that case, while will go to minus infinite. This implies that for small enough, we have that (because of the term ). In conclusion, our density estimator is not locally -monotonic and, as this holds for all , we conclude that it is not weakly -monotonic for any .