1 Introduction
“Overfitting” is one of the biggest interests and concerns in the machine learning community
(Ng, 1997; Caruana et al., 2000; Belkin et al., 2018; Roelofs et al., 2019; Werpachowski et al., 2019). One way of identifying if overfitting is happening or not, is to see whether the generalization gap, the test minus the training loss, is increasing or not (Goodfellow et al., 2016). We can further decompose this situation of the generalization gap into two concepts: The first concept is the situation where both the training and test losses are decreasing, but the training loss is decreasing faster than the test loss ([A] in Fig. 0(a).) The next concept is the situation where the training loss is decreasing, but the test loss is increasing. This tends to occur after the first concept ([B] in Fig. 0(a)).Within the concept [B], after learning for even more epochs, the training loss will continue to decrease and may become (near)zero. This is shown as [C] in Fig.
0(a). If you continue training even after the model has memorized (Zhang et al., 2017; Arpit et al., 2017; Belkin et al., 2018) the training data completely with zero error, the training loss can easily become (near)zero especially with overparametrized models. Recent works on overparametrization and double descent curves (Belkin et al., 2019; Nakkiran et al., 2020) have shown that learning until zero training error is meaningful to achieve a lower generalization error. However, whether zero training loss is necessary after achieving zero training error remains an open issue.In this paper, we propose a method to make the training loss float around a small constant value, in order to prevent the training loss from approaching zero. This is analogous to flooding the bottom area with water, and we refer to the constant value as the flooding level. Note that even if we add flooding, we can still memorize the training data. Our proposal only forces the training loss to become positive, which does not necessarily mean the training error will become positive, as long as the flooding level is not too large. The idea of flooding is shown in Fig. 0(b), and we show learning curves before and after flooding with benchmark experiments in Fig. 0(c) and Fig. 0(d).^{1}^{1}1For the details of these experiments, see Appendix D.
Algorithm and implementation
Our algorithm of flooding is surprisingly simple. If the original learning objective is , the proposed modified learning objective with flooding is
(1) 
where is the flooding level specified by the user, and is the model parameter.
The gradient of w.r.t. will point in the same direction as that of when but in the opposite direction when . This means that when the learning objective is above the flooding level, there is a “gravity” effect with gradient descent, but when the learning objective is below the flooding level, there is a “buoyancy” effect with gradient ascent. In practice, this will be performed with a minibatch, and will be compatible with any stochastic optimizers. It can also be used along with other regularization methods.
During flooding, the training loss will repeat going below and above the flooding level. The model will continue to “random walk” with the same nonzero training loss, and we expect it to drift into an area with a flat loss landscape that leads to better generalization (Chaudhari et al., 2017; Keskar et al., 2017; Li et al., 2018). ^{2}^{2}2In Appendix F
, we show that during this period of random walk, there is an increase in flatness of the loss function.
Since it is a simple solution, this modification can be incorporated into existing machine learning code easily: Add one line of code for Eq. (1), after evaluating the original objective function
. A minimal working example with a minibatch in PyTorch
(Paszke et al., 2019) is demonstrated below to show the additional one line of code:It may be hard to set the flooding level without expert knowledge on the domain or task. We can circumvent this situation easily, by treating the flooding level as a hyperparameter. We may use a naive search, which exhaustively evaluates the accuracy for the predefined hyperparameter candidates with a validation dataset. This procedure can be performed in parallel.
Previous regularization methods
Many previous regularization methods also aim at avoiding training too much in various ways, e.g., restricting the parameter norm to become small by decaying the parameter weights (Hanson and Pratt, 1988)
, raising the difficulty of training by dropping activations of neural networks
(Srivastava et al., 2014), smoothing the training labels (Szegedy et al., 2016), or simply stopping training at an earlier phase (Morgan and Bourlard, 1990). These methods can be considered as indirect ways to control the training loss, by also introducing additional assumptions, e.g., the optimal model parameters are close to zero. Although making the regularization effect stronger would make it harder for the training loss to approach zero, it is still hard to maintain the right level of training loss till the end of training. In fact, for overparametrized deep networks, applying a small regularization parameter would not stop the training loss becoming (near)zero, making it even harder to choose a hyperparameter that corresponds to a specific level of loss.Flooding, on the other hand, is a direct solution to the issue that the training loss becomes (near)zero. Flooding intentionally prevents further reduction of the training loss when it reaches a reasonably small value, and the flooding level corresponds to the level of training loss that the user wants to keep.
Contributions
Our proposed regularizer called flooding makes the training loss float around a small constant value, instead of making it head towards zero loss. Flooding is a regularizer that is domain, task, and modelindependent. Theoretically, we find that the mean squared error can be reduced with flooding under certain conditions. Not only do we show test accuracy improving after flooding, we also observe that even after we avoid zero training loss, memorization with zero training error still takes place.
2 Backgrounds
In this section, we review regularization methods (summarized in Table 1
), recent works on overparametrization and double descent curves, and the area of weakly supervised learning where similar techniques to flooding has been explored.
2.1 Regularization Methods
The name “regularization” dates back to at least Tikhonov regularization for the illposed linear leastsquares problem (Tikhonov, 1943; Tikhonov and Arsenin, 1977). One example is to modify (where is the design matrix) to become “regular” by adding a term to the objective function. regularization is a generalization of the above example and can be applied to nonlinear models. These methods implicitly assume that the optimal model parameters are close to zero.
It is known that weight decay (Hanson and Pratt, 1988), dropout (Srivastava et al., 2014), and early stopping (Morgan and Bourlard, 1990) are equivalent to regularization under certain conditions (Loshchilov and Hutter, 2019; Bishop, 1995; Goodfellow et al., 2016; Wager et al., 2013), implying that there is a similar assumption on the optimal model parameters. There are other penalties based on different assumptions, such as the regularization (Tibshirani, 1996) based on the sparsity assumption that the optimal model has only a few nonzero parameters.
Modern machine learning tasks are applied to complex problems where the optimal model parameters are not necessarily close to zero or may not be sparse, and it would be ideal if we can properly add regularization effects to the optimization stage without such assumptions. Our proposed method does not have assumptions on the optimal model parameters and can be useful for more complex problems.
More recently, “regularization” has further evolved to a more general meaning, including various methods that alleviate overfitting, but do not necessarily have a step to regularize a singular matrix or add a regularization term to the objective function. For example, Goodfellow et al. (2016) defines regularization as “any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.” In this paper, we adopt this broader meaning of “regularization.”
Examples of the more general regularization category include mixup (Zhang et al., 2018) and data augmentation methods like cropping and flipping or adjusting brightness or sharpness (Shorten and Khoshgoftaar, 2019). These methods have been adopted in many papers to obtain stateoftheart performance (Verma et al., 2019; Berthelot et al., 2019) and are becoming essential regularization tools for developing new systems. However, these regularization methods have the drawback of being domainspecific: They are designed for the vision domain and require some efforts when applying to other domains (Guo et al., 2019; Thulasidasan et al., 2019). Other regularizers such as label smoothing (Szegedy et al., 2016)
is used for problems with class labels, and harder to use with regression or ranking, meaning they are taskspecific. Batch normalization
(Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) are designed for neural networks and are modelspecific.Although these regularization methods—both the special and general ones—already work well in practice and have become the de facto standard tools (Bishop, 2011; Goodfellow et al., 2016), we provide an alternative which is even more general in the sense that it is domain, task, and modelindependent.
That being said, we want to emphasize that the most important difference between flooding and other regularization methods is whether it is possible to target a specific level of training loss other than zero. While flooding allows the user to choose the level of training loss directly, it is hard to achieve this with other regularizers.
Regularization and other methods 




Main assumption  

regularization (Tikhonov, 1943)  Optimal model params are close to 0  
Weight decay (Hanson and Pratt, 1988)  Optimal model params are close to 0  
Early stopping (Morgan and Bourlard, 1990)  Overfitting occurs in later epochs  
regularization (Tibshirani, 1996)  Optimal model has to be sparse  
Dropout (Srivastava et al., 2014)  Weight scaling inference rule  
Batch normalization (Ioffe and Szegedy, 2015)  Existence of internal covariate shift  
Label smoothing (Szegedy et al., 2016)  True posterior is not a onehot vector 

Mixup (Zhang et al., 2018)  Linear relationship between and  
Image augment. (Shorten and Khoshgoftaar, 2019)  Input is invariant to the translations  
Flooding (proposed method)  Learning until zero loss is harmful 
2.2 Double Descent Curves with Overparametrization
Recently, there has been increasing attention on the phenomenon of “double descent,” named by Belkin et al. (2019)
to explain the two regimes of deep learning: The first one (underparametrized regime) occurs where the model complexity is small compared to the number of training samples, and the test error as a function of model complexity decreases with low model complexity but starts to increase after the model complexity is large enough. This follows the classical view of machine learning that excessive complexity leads to poor generalization. The second one (overparametrized regime) occurs when an even larger model complexity is considered. Then increasing the complexity only decreases test error, which leads to a double descent shape. The phase of decreasing test error often occurs after the training error becomes zero. This follows the modern view of machine learning that bigger models lead to better generalization.
^{3}^{3}3https://www.eff.org/ai/metricsAs far as we know, the discovery of double descent curves dates back to at least Krogh and Hertz (1992)
, where they theoretically showed the double descent phenomenon under a linear regression setup. Recent works
(Belkin et al., 2019; Nakkiran et al., 2020) have shown empirically that a similar phenomenon can be observed with deep learning methods. Nakkiran et al. (2020) observed that the double descent curves can be shown not only as a function of model complexity, but also as a function of the epoch number.We want to note a byproduct of our flooding method: We were able to produce the epochwise double descent curve for the test loss with about 100 epochs. Investigating the connection between our accelerated double descent curves and previous double descent curves (Krogh and Hertz, 1992; Belkin et al., 2019; Nakkiran et al., 2020) is out of the scope of this paper but is an important future direction.
2.3 LowerBounding the Empirical Risk
Lowerbounding the empirical risk has been used in the area of weakly supervised learning: There were a common phenomenon where the empirical risk goes below zero (Kiryo et al., 2017), when an equivalent form of the risk expressed with the given weak supervision was alternatively used (Natarajan et al., 2013; CidSueiro et al., 2014; du Plessis et al., 2014, 2015; Patrini et al., 2017; van Rooyen and Williamson, 2018). A gradient ascent technique was used to force the empirical risk to become nonnegative in Kiryo et al. (2017). This idea has been generalized and applied to other weakly supervised settings (Han et al., 2018; Ishida et al., 2019; Lu et al., 2020).
Although we also set a lower bound on the empirical risk, the motivation is different: First, while Kiryo et al. (2017) and others aim to fix the negative empirical risk to become lower bounded by zero, our empirical risk already has a lower bound of zero. Instead, we are aiming to sink the original empirical risk, by placing a positive lower bound. Second, the problem settings are different. Weakly supervised learning methods require certain loss corrections or sample corrections (Han et al., 2018) before the nonnegative correction, but we work on the original empirical risk without any settingspecific modifications.
3 Flooding: How to Avoid Zero Training Loss
In this section, we propose our regularization method, flooding. Note that this section and the following sections only consider multiclass classification for simplicity.
3.1 Preliminaries
Consider input variable and output variable , where
is the number of classes. They follow an unknown joint probability distribution with density
. We denote the score function by . For any test data point , our prediction of the output label will be given by , where is the th element of , and in case of a tie, returns the largest argument. Let denote a loss function. can be the zeroone loss,(2) 
where , or a surrogate loss such as the softmax crossentropy loss,
(3) 
For a surrogate loss , we denote the classification risk by
(4) 
where is the expectation over . We use to denote Eq. (4) when and call it the classification error.
The goal of multiclass classification is to learn that minimizes the classification error . In optimization, we consider the minimization of the risk with a almost surely differentiable surrogate loss instead to make the problem more tractable. Furthermore, since is usually unknown and there is no way to exactly evaluate , we minimize its empirical version calculated from the training data instead:
(5) 
where are i.i.d. sampled from . We call the empirical risk.
We would like to clarify some of the undefined terms used in the title and the introduction. The “train/test loss” is the empirical risk with respect to the surrogate loss function over the training/test data, respectively. We refer to the “training/test error” as the empirical risk with respect to over the training/test data, respectively (which is equal to one minus accuracy) (Zhang, 2004).
Finally, we formally define the Bayes risk as
(6) 
where the infimum is taken over all vectorvalued functions . The Bayes risk is often referred to as the Bayes error if the zeroone loss is used:
(7) 
3.2 Algorithm
With flexible models, w.r.t. a surrogate loss can easily become small if not zero, as we mentioned in Section 1; see [C] in Fig. 0(a). We propose a method that “floods the bottom area and sinks the original empirical risk” as in Fig. 0(b) so that the empirical risk cannot go below the flooding level. More technically, if we denote the flooding level as , our proposed training objective with flooding is a simple fix:
Definition 1.
Note that when , then . The gradient of w.r.t. model parameters will point to the same direction as that of when but in the opposite direction when . This means that when the learning objective is above the flooding level, we perform gradient descent as usual (gravity zone), but when the learning objective is below the flooding level, we perform gradient ascent instead (buoyancy zone).
The issue is that in general, we seldom know the optimal flooding level in advance. This issue can be mitigated by searching for the optimal flooding level with a hyperparameter optimization technique. In practice, we can search for the optimal flooding level by performing the exhaustive search in parallel.
3.3 Implementation
For large scale problems, we can employ minibatched stochastic optimization for efficient computation. Suppose that we have disjoint minibatch splits. We denote the empirical risk (5) with respect to the th minibatch by for . Then, our minibatched optimization performs gradient descent updates in the direction of the gradient of . By the convexity of the absolute value function and Jensen’s inequality, we have
(9) 
This indicates that minibatched optimization will simply minimize an upper bound of the fullbatch case with .
3.4 Theoretical Analysis
In the following theorem, we will show that the mean squared error (MSE) of the proposed risk estimator with flooding is smaller than that of the original risk estimator without flooding.
Theorem 1.
Fix any measurable vectorvalued function . If the flooding level satisfies , we have
(10) 
If , we have
(11) 
A proof is given in Appendix A. If we regard as the training loss and as the test loss, we would want to be between those two for the MSE to improve.
4 Experiments
In this section, we show experimental results with synthetic and benchmark datasets. The implementation is based on PyTorch (Paszke et al., 2019) and demo code will be available^{5}^{5}5https:/github.com/takashiishida/flooding. Experiments were carried out with NVIDIA GeForce GTX 1080 Ti, NVIDIA Quadro RTX 5000 and Intel Xeon Gold 6142.
4.1 Synthetic Experiments
The aim of our synthetic experiments is to study the behavior of flooding with a controlled setup. We use three types of synthetic data described below.
Two Gaussians Data: We perform binary classification with two
dimensional Gaussian distributions with covariance matrix identity and means
and , where . The Bayes risk for and are and , respectively, where proofs are shown in Appendix B. The training, validation, and test sample sizes are , , and per class respectively.Sinusoid Data: The sinusoid data (Nakkiran et al., 2019) are generated as follows. We first draw input data points uniformly from the inside of a dimensional ball of radius 1. Then we put class labels based on
where and are any two dimesional vectors such that . The training, validation, and test sample sizes are , , and , respectively.
Spiral Data: The spiral data (Sugiyama, 2015) are twodimensional synthetic data. Let be equally spaced points in the interval , and be equally spaced points in the interval . Let positive and negative input data points be
for and where controls the magnitude of the noise, and
are i.i.d. distributed according to the twodimensional standard normal distribution. Then, we make data for classification by
, where . The training, validation, and test sample sizes are , , and per class respectively.For Two Gaussians, we use a onehiddenlayer feedforward neural network with 500 units in the hidden layer with the ReLU activation function
(Nair and Hinton, 2010). We train the network for 1000 epochs with the logistic loss and vanilla gradient descent with learning rate of . The flooding level is chosen from . For Sinusoid and Spiral, we use a fourhiddenlayer feedforward neural network with 500 units in the hidden layer, with the ReLU activation function (Nair and Hinton, 2010), and batch normalization (Ioffe and Szegedy, 2015). We train the network for 500 epochs with the logistic loss and Adam (Kingma and Ba, 2015) optimizer with minibatch size and learning rate of . The flooding level is chosen from . Note that training with is identical to the baseline method without flooding. We report the test accuracy of the flooding level with the best validation accuracy. We first conduct experiments without early stopping, which means that the last epoch was chosen for all flooding levels.(A) Without Early Stopping  (B) With Early Stopping  
Data  Setting  Without Flooding  With Flooding  Chosen  Without Flooding  With Flooding  Chosen 
Two Gaussians  : 1.0, BR: 0.14  87.96%  92.25%  0.28  91.63%  92.25%  0.27 
Two Gaussians  : 0.8, BR: 0.24  82.00%  87.31%  0.33  86.57%  87.29%  0.35 
Sinusoid  Label Noise: 0.01  93.84%  94.46%  0.01  92.54%  92.54%  0.00 
Sinusoid  Label Noise: 0.05  91.12%  95.44%  0.10  93.26%  94.60%  0.01 
Sinusoid  Label Noise: 0.10  86.57%  96.02%  0.17  96.70%  96.70%  0.00 
Spiral  Label Noise: 0.01  98.96%  97.85%  0.01  98.60%  98.88%  0.01 
Spiral  Label Noise: 0.05  93.87%  96.24%  0.04  96.58%  95.62%  0.14 
Spiral  Label Noise: 0.10  89.70%  92.96%  0.16  89.70%  92.96%  0.16 
Results
The results are summarized in Table 2. It is worth noting that for Two Gaussians, the chosen flooding level is larger for the smaller distance between the two distributions, which is when the classification task is harder and the Bayes risk becomes larger since the two distributions become less separated. We see similar tendencies for Sinusoid and Spiral data: a larger was chosen for larger flipping probability for label noise, which is expected to increase the Bayes risk. This implies the positive correlation between the optimal flooding level and the Bayes risk, as is also partially suggested by Theorem 1. Another interesting observation is that the chosen is close to but higher than the Bayes risk for Two Gaussians data. This may look inconsistent with Theorem 1. However, it makes sense to adopt larger
with stronger regularization effect that allows some bias as a tradeoff for reducing the variance of the risk estimator. In fact, Theorem
1 does not deny the possibility that some achieves even better estimation.From (A) in Table 2, we can see that the method with flooding often improves test accuracy over the baseline method without flooding. As we mentioned in the introduction, it ca be harmful to keep training a model until the end without flooding. However, with flooding, the model at the final epoch has good prediction performance according to the results, which implies that flooding helps the latestage training improve test accuracy.
We also conducted experiments with early stopping, meaning that we chose the model that recorded the best validation accuracy during training. The results are reported in subtable (B) of Table 2. Compared with subtable (A), we see that early stopping improves the baseline method without flooding well in many cases. This indicates that training longer without flooding was harmful in our experiments. On the other hand, the accuracy for flooding combined with early stopping is often close to that with early stopping, meaning that training until the end with flooding tends to be already as good as doing so with early stopping. The table shows that flooding often improves or retains the test accuracy of the baseline method without flooding even after deploying early stopping. Flooding does not hurt performance but can be beneficial for methods used with early stopping.
Dataset  tr/  W:  

va  E:  
split  F:  
MNIST  0.8  98.32%  98.30%  98.51%  98.42%  98.46%  98.53%  98.50%  98.48%  
0.4  97.71%  97.70%  97.82%  97.91%  97.74%  97.85%  —  97.83%  
Fashion  0.8  89.34%  89.36%  N/A  N/A  —  —  N/A  N/A  
MNIST  0.4  88.48%  88.63%  88.60%  88.62%  —  —  —  —  
Kuzushiji  0.8  91.63%  91.62%  91.63%  91.71%  92.40%  92.12%  92.11%  91.97%  
MNIST  0.4  89.18%  89.18%  89.58%  89.73%  90.41%  90.15%  89.71%  89.88%  
CIFAR10  0.8  73.59%  73.36%  73.65%  73.57%  73.06%  73.44%  —  74.41%  
0.4  66.39%  66.63%  69.31%  69.28%  67.20%  67.58%  —  —  
CIFAR100  0.8  42.16%  42.33%  42.67%  42.45%  42.50%  42.36%  —  —  
0.4  34.27%  34.34%  37.97%  38.82%  34.99%  35.14%  —  —  
SVHN  0.8  92.38%  92.41%  93.20%  92.99%  92.78%  92.79%  —  93.42%  
0.4  90.32%  90.35%  90.43%  90.49%  90.57%  90.61%  91.16%  91.21% 
4.2 Benchmark Experiments
We next perform experiments with benchmark datasets. Not only do we compare with the baseline without flooding, we also compare or combine with other general regularization methods, which are early stopping and weight decay.
Settings
We use the following six benchmark datasets: MNIST, FashionMNIST, KuzushijiMNIST, CIFAR10, CIFAR100, and SVHN. The details of the benchmark datasets can be found in Appendix C.1. We split the original training dataset into training and validation data with different proportions: 0.8 or 0.4 (meaning 80% or 40% was used for training and the rest was used for validation, respectively). We perform the exhaustive hyperparameter search for the flooding level with candidates from
. The number of epochs is 500. Stochastic gradient descent
(Robbins and Monro, 1951) is used with learning rate of 0.1 and momentum of 0.9. For MNIST, FashionMNIST, and KuzushijiMNIST, we use a onehiddenlayer feedforward neural network with 500 units and ReLU activation function (Nair and Hinton, 2010). For CIFAR10, CIFAR100, and SVHN, we used ResNet18 (He et al., 2016). We do not use any data augmentation or manual learning rate decay. We deployed early stopping in the same way as in Section 4.1.We first ran experiments with the following candidates for the weight decay rate: . We choose the weight decay rate with the best validation accuracy, for each dataset and each training/validation proportion. Then, fixing the weight decay to the chosen one, we ran experiments with flooding level candidates from , to investigate whether weight decay and flooding have complementary effects, or if adding weight decay will diminish the accuracy gain of flooding.
Results
We show the results in Table 3 and the chosen flooding levels in Table 4 in Appendix C.2. We can observe that flooding gives better accuracy for most cases. We can also see that combining flooding with early stopping or with both early stopping and weight decay may lead to even better accuracy in some cases.
4.3 Memorization
Can we maintain memorization even after adding flooding? We investigate if the trained model has zero training error (100% accuracy) for the flooding level that was chosen with validation data. We show the results for all benchmark datasets and all training/validation splits with proportions 0.8 and 0.4. We also show the case without early stopping (choosing the last epoch) and with early stopping (choosing the epoch with the highest validation accuracy). The results are shown in Fig. 3.
All figures show downward curves, implying that the model will give up eventually on memorizing all training data as the flooding level becomes higher. A more interesting and important observation is the position of the optimal flooding level (the one chosen by validation accuracy which is marked with , , , , or ). We can observe that the marks are often plotted at zero error, and in some cases there is a mark on the highest flooding level that maintains zero error. These results are consistent with recent empirical works that imply zero training error leads to lower generalization error (Belkin et al., 2019; Nakkiran et al., 2020), but we further demonstrate that zero training loss may be harmful under zero training error.
5 Conclusion
We proposed a novel regularization method called flooding that keeps the training loss to stay around a small constant value, to avoid zero training loss. In our experiments, the optimal flooding level often maintained memorization of training data, with zero error. With flooding, we showed that the test accuracy will improve for various benchmark datasets, and theoretically showed that the mean squared error will be reduced under certain conditions.
As a byproduct, we were able to produce a double descent curve for the test loss with a relatively few number of epochs, e.g., in around 100 epochs, shown in Fig. 2 and Fig. 4 in Appendix D. An important future direction is to study the relationship between this and the double descent curves from previous works (Krogh and Hertz, 1992; Belkin et al., 2019; Nakkiran et al., 2020).
Acknowledgements
We thank Chang Xu, Genki Yamamoto, Kento Nozawa, Nontawat Charoenphakdee, Voot Tangkaratt, and Yoshihiro Nagano for the helpful discussions. TI was supported by the Google PhD Fellowship Program. MS and IY were supported by JST CREST Grant Number JPMJCR18A2 including AIP challenge program, Japan.
References
 Arpit et al. (2017) Devansh Arpit, Stanisław Jastrzȩbski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon LacosteJulien. A closer look at memorization in deep networks. In ICML, 2017.

Belkin et al. (2018)
Mikhail Belkin, Daniel Hsu, and Partha P. Mitra.
Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate.
In NeurIPS, 2018.  Belkin et al. (2019) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias–variance tradeoff. PNAS, 116:15850–15854, 2019.

Berthelot et al. (2019)
David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital
Oliver, and Colin A Raffel.
MixMatch: A holistic approach to semisupervised learning.
In NeurIPS, 2019.  Bishop (2011) Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 2011.
 Bishop (1995) Christopher Michael Bishop. Regularization and complexity control in feedforward networks. In ICANN, 1995.

Caruana et al. (2000)
Rich Caruana, Steve Lawrence, and C. Lee Giles.
Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.
In NeurIPS, 2000.  Chaudhari et al. (2017) Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. EntropySGD: Biasing gradient descent into wide valleys. In ICLR, 2017.
 CidSueiro et al. (2014) Jesús CidSueiro, Darío GarcíaGarcía, and Raúl SantosRodríguez. Consistency of losses for learning from weak labels. In ECMLPKDD, 2014.
 Clanuwat et al. (2018) Tarin Clanuwat, Mikel BoberIrizar, Asanobu Kitamoto, Alex Lamb, Kazuaki Yamamoto, and David Ha. Deep learning for classical Japanese literature. In NeurIPS Workshop on Machine Learning for Creativity and Design, 2018.
 du Plessis et al. (2014) Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Analysis of learning from positive and unlabeled data. In NeurIPS, 2014.
 du Plessis et al. (2015) Marthinus Christoffel du Plessis, Gang Niu, and Masashi Sugiyama. Convex formulation for learning from positive and unlabeled data. In ICML, 2015.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
 Guo et al. (2019) Hongyu Guo, Yongyi Mao, and Richong Zhang. Augmenting data with mixup for sentence classification: An empirical study. In arXiv:1905.08941, 2019.
 Han et al. (2018) Bo Han, Gang Niu, Jiangchao Yao, Xingrui Yu, Miao Xu, Ivor Tsang, and Masashi Sugiyama. Pumpout: A meta approach to robust deep learning with noisy labels. In arXiv:1809.11008, 2018.
 Hanson and Pratt (1988) Stephen Jose Hanson and Lorien Y. Pratt. Comparing biases for minimal network construction with backpropagation. In NeurIPS, 1988.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 Ishida et al. (2019) Takashi Ishida, Gang Niu, Aditya Krishna Menon, and Masashi Sugiyama. Complementarylabel learning for arbitrary losses and models. In ICML, 2019.
 Keskar et al. (2017) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy L. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kiryo et al. (2017) Ryuichi Kiryo, Gang Niu, Marthinus Christoffel du Plessis, and Masashi Sugiyama. Positiveunlabeled learning with nonnegative risk estimator. In NeurIPS, 2017.
 Krogh and Hertz (1992) Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In NeurIPS, 1992.
 Lecun et al. (1998) Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.
 Li et al. (2018) Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In NeurIPS, 2018.
 Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
 Lu et al. (2020) Nan Lu, Tianyi Zhang, Gang Niu, and Masashi Sugiyama. Mitigating overfitting in supervised classification from two unlabeled datasets: A consistent risk correction approach. In AISTATS, 2020.
 Morgan and Bourlard (1990) N. Morgan and H. Bourlard. Generalization and parameter estimation in feedforward nets: Some experiments. In NeurIPS, 1990.
 Nair and Hinton (2010) Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.
 Nakkiran et al. (2019) Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang, and Boaz Barak. SGD on neural networks learns functions of increasing complexity. In NeurIPS, 2019.
 Nakkiran et al. (2020) Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In ICLR, 2020.
 Natarajan et al. (2013) Nagarajan Natarajan, Inderjit S. Dhillon, Pradeep K. Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NeurIPS, 2013.
 Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 Ng (1997) Andrew Y. Ng. Preventing “overfitting” of crossvalidation data. In ICML, 1997.
 Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, highperformance deep learning library. In NeurIPS, 2019.
 Patrini et al. (2017) Giorgio Patrini, Alessandro Rozza, Aditya K. Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, 2017.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407, 1951.
 Roelofs et al. (2019) Rebecca Roelofs, Vaishaal Shankar, Benjamin Recht, Sara FridovichKeil, Moritz Hardt, John Miller, and Ludwig Schmidt. A metaanalysis of overfitting in machine learning. In NeurIPS, 2019.
 Shahriari et al. (2016) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104:148–175, 2016.
 Shorten and Khoshgoftaar (2019) Connor Shorten and Taghi M. Khoshgoftaar. A survey on image data augmentation for deep learning. Journal of Big Data, 6, 2019.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. In NeurIPS, 2012.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
 Sugiyama (2015) Masashi Sugiyama. Introduction to statistical machine learning. Morgan Kaufmann, 2015.

Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, and Jon Shlens.
Rethinking the inception architecture for computer vision.
In CVPR, 2016.  Thulasidasan et al. (2019) Sunil Thulasidasan, Gopinath Chennupati, Jeff Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks. In NeurIPS, 2019.
 Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288, 1996.
 Tikhonov and Arsenin (1977) A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill Posed Problems. Winston, 1977.
 Tikhonov (1943) Andrey Nikolayevich Tikhonov. On the stability of inverse problems. Doklady Akademii Nauk SSSR, 39:195–198, 1943.

Torralba et al. (2008)
Antonio Torralba, Rob Fergus, and William T. Freeman.
80 million tiny images: A large data set for nonparametric object and scene recognition.
In IEEE Trans. PAMI, 2008.  van Rooyen and Williamson (2018) Brendan van Rooyen and Robert C. Williamson. A theory of learning with corrupted labels. Journal of Machine Learning Research, 18:1–50, 2018.
 Verma et al. (2019) Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David LopezPaz. Interpolation consistency training for semisupervised learning. In IJCAI, 2019.
 Wager et al. (2013) Stefan Wager, Sida Wang, and Percy Liang. Dropout training as adaptive regularization. In NeurIPS, 2013.
 Werpachowski et al. (2019) Roman Werpachowski, András György, and Csaba Szepesvári. Detecting overfitting via adversarial examples. In NeurIPS, 2019.
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 Zhang et al. (2018) Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David LopezPaz. mixup: Beyond empirical risk minimization. In ICLR, 2018.
 Zhang (2004) Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32:56–85, 2004.
Appendix A Proof of Theorem
Proof.
If the flooding level is , then the proposed flooding estimator is
(12) 
Since the absolute operator can be expressed with a max operator with , the proposed estimator can be reexpressed as,
(13) 
For convenience, we used . From the definition of MSE,
(14)  
(15)  
(16)  
(17) 
We are interested in the sign of
(18) 
Define the inside of the expectation as . can be divided into two cases, depending on the outcome of the max operator:
(19)  
(20)  
(21) 
The latter case becomes positive when . Therefore, when ,
(22)  
(23) 
When ,
(24)  
(25) 
∎
Appendix B Bayes Risk for Gaussian Distributions
In this section, we explain in detail how we derived the Bayes risk with respect to the surrogate loss in the experiments with Gaussian data in Section 4.1. Since we are using the logistic loss in the synthetic experiments, the loss of the margin is
(26) 
where is a scalar instead of the vector definition that was used previously, because the synthetic experiments only consider binary classification. Take the derivative to derive,
(27)  
(28)  
(29)  
(30)  
(31) 
Set this to zero, divide by to obtain,
(32)  
(33)  
(34) 
Since we are interested in the surrogate loss under this classifier, we plug this into the logistic loss, to obtain the Bayes risk,
(35) 
In the experiments in Section 4.1, we report the empirical version of this with the test dataset as the Bayes risk.
Appendix C Details of Experiments
c.1 Benchmark Datasets
In the experiments in Section 4.2, we use six benchmark datasets explained below.

MNIST^{6}^{6}6http://yann.lecun.com/exdb/mnist/ [Lecun et al., 1998] is a 10 class dataset of handwritten digits: and . Each sample is a grayscale image. The number of training and test samples are 60,000 and 10,000, respectively.

FashionMNIST^{7}^{7}7https://github.com/zalandoresearch/fashionmnist [Xiao et al., 2017] is a 10 class dataset of fashion items: Tshirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, and Ankle boot. Each sample is a grayscale image. The number of training and test samples are 60,000 and 10,000, respectively.

KuzushijiMNIST^{8}^{8}8https://github.com/roiscodh/kmnist [Clanuwat et al., 2018] is a 10 class dataset of cursive Japanese (“Kuzushiji”) characters. Each sample is a grayscale image. The number of training and test samples are 60,000 and 10,000, respectively.

CIFAR10^{9}^{9}9https://www.cs.toronto.edu/~kriz/cifar.html is a 10 class dataset of various objects: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each sample is a colored image in RGB format. It is a subset of the 80 million tiny images dataset [Torralba et al., 2008]. There are 6,000 images per class, where 5,000 are for training and 1,000 are for test.

CIFAR100^{10}^{10}10https://www.cs.toronto.edu/~kriz/cifar.html is a 100 class dataset of various objects. Each class has 600 samples, where 500 samples are for training and 100 samples are for test. This is also a subset of the 80 million tiny images dataset [Torralba et al., 2008].

SVHN^{11}^{11}11http://ufldl.stanford.edu/housenumbers/ [Netzer et al., 2011] is a 10 class dataset of house numbers from Google Street View images, in RGB format. 73257 digits are for training and 26032 digits are for testing.
c.2 Chosen Flooding Levels
In Table 4, we report the chosen flooding levels for our experiments with benchmark datasets.
Early stopping  
Weight decay  
Flooding  
MNIST (0.8)  0.02  0.02  0.03  0.02 
MNIST (0.4)  0.02  0.03  0.00  0.02 
FashionMNIST (0.8)  0.00  0.00  —  — 
FashionMNIST (0.4)  0.00  0.00  —  — 
KuzushijiMNIST (0.8)  0.01  0.03  0.03  0.03 
KuzushijiMNIST (0.4)  0.03  0.02  0.04  0.03 
CIFAR10 (0.8)  0.04  0.04  0.00  0.01 
CIFAR10 (0.4)  0.03  0.03  0.00  0.00 
CIFAR100 (0.8)  0.02  0.02  0.00  0.00 
CIFAR100 (0.4)  0.03  0.03  0.00  0.00 
SVHN (0.8)  0.01  0.01  0.00  0.02 
SVHN (0.4)  0.01  0.09  0.03  0.03 
Appendix D Learning Curves
In Figure 4, we visualize learning curves for all datasets (including CIFAR10 which we already visualized in Figure 2). We only show the learning curves for training/validation proportion of 0.8, since the results for 0.4 were similar with 0.8. Note that Figure 0(c) shows the learning curves for the first 80 epochs for CIFAR10 without flooding. Figure 0(d) shows the learning curves with flooding, when the flooding level is .
Appendix E Relationship between Performance and Gradients
Settings
We visualize the relationship between test performance (loss or accuracy) and gradient amplitude of the training/test loss in Figure 5, where the gradient amplitude is the norm of the filternomalized gradient of the loss. The filternormalized gradient is the gradient appropriately scaled depending on the magnitude of the corresponding convolutional filter, similarly to Li et al. [2018]. More specicically, for each filter of every comvolutional layer, we multiply the corresponding elements of the gradient by the norm of the filter. Note that a fully connected layer is a special case of convolutional layer and subject to this scaling. We exclude FashionMNIST because the optimal flooding level was zero. We used the results with training/validation split ratio of 0.8.
Results
For the figures with gradient amplitude of training loss on the horizontal axis, “” marks (w/ flooding) are often plotted on the right of “” marks (w/o flooding), which implies that flooding prevents the model from staying a local minimum. For the figures with gradient amplitude of test loss on the horizontal axis, we can observe the method with flooding (“”) improves performance while the gradient amplitude becomes smaller. On the other hand, the performance with the method without flooding (“”) degenerates while the gradient amplitude of test loss keeps increasing.
Appendix F Flatness
Settings
We follow Li et al. [2018] and give a onedimensional visualization of flatness for each dataset in Figure 6. We exclude FashionMNIST because the optimal flooding level was zero. We used the results with training/validation split ratio of 0.8. We compare the flatness of the model right after the empirical risk with respect to a minibatch becomes smaller than the flooding level, , for the first time (dotted blue line) and the model after training (solid blue line). We also compare them with the model trained by the baseline method without flooding, and training is finished (solid red line).
Results
According to Figure 6, the test loss becomes lower and more flat during the training with flooding. Note that the training loss, on the other hand, continues to float around the flooding level until the end of training after it enters the flooding zone. We expect that the model makes a random walk and escapes regions with sharp loss landscapes during the period. This may be a possible reason for better generalization results with our proposed method.
Comments
There are no comments yet.