I Introduction
Machine learning has achieved great success in many areas. However, such machine learning models suffer from an overfitting problem caused by a lack of data [1, 2]. To tackle such problems, research on semisupervised learning [3, 4] or regularization [5, 6] has been very active. The main idea of semisupervised learning is to solve supervised learning problems with few labels by utilizing unlabeled data. In realworld machine learning problems, labeled data is often scarce, but unlabeled data is abundant. Therefore, semisupervised learning methods that make good use of unlabeled data are essential.
We focus on leveraging the class density of the entire dataset as prior knowledge about labeled and unlabeled data. This means that we assume that the density of each class is obtained as prior knowledge. This assumption is a natural one in many actual machine learning problems. Based on this idea, we propose a framework of regularization methods, called densityfixing, both supervised and semisupervised settings can commonly use that. Our proposed densityfixing regularization improves the generalization performance by forcing the model to approximate the class’s prior distribution or the frequency of occurrence. This regularization term of densityfixing is naturally derived from the formula for maximum likelihood estimation and is theoretically justified. We further investigated the asymptotic behavior of the densityfixing and how the regularization terms behave when assuming a prior distribution of several classes in practice. Experimental results on multiple benchmark datasets are sufficient to support our argument, and we suggest that this simple and effective regularization method is useful in realworld problems.
Contribution: We propose the densityfixing regularization, which has the following properties:

simplicity: densityfixing is very simple to implement and has almost no computational overhead.

naturalness: densityfixing is derived naturally from the formula for maximum likelihood estimation and has a theoretical guarantee.

versatility: densityfixing is generally applicable to many problem settings.
In a nutshell, densityfixing forcing the balance of class density:
(1) 
where
is the some loss function (e.g. crossentropy loss), and
is the parameter of the regularization term. For the true distribution of a class, we can use it if it is given as prior knowledge, otherwise we can average the frequency of occurrence of the labels in the training sample and use it as an estimator :(2) 
The sample mean provides the unbiased and consistent estimator of the frequency of class occurrence, so it is sufficient to use it.
The sourcecode necessary to replicate our CIFAR10 experiments is available at GitHub ^{1}^{1}1https://github.com/nocotan/density_fixing
Ii Related Works
In this section, we introduce some related works that are relevant to our work.
Iia Overfitting and Regularization
Machine learning models suffer from an overfitting problem caused by a lack of data. In order to avoid overfitting, various regularization methods have been proposed. For example, Dropout [6]
is a powerful regularization method that introduces ensemble learninglike behavior by randomly removing connections between neurons of the Deep Neural Network. Another recently proposed simple regularization method is mixup and its variants
[5, 7, 8], which takes a linear combination of training data as a new input. There are many regularization methods for some specific models (e.g., for Generative Adversarial Networks [9, 10]).IiB SemiSupervised Learning
Iii Notations and Problem Formulation
Let be the input space, be the output space, be the number of classes and
be a set of concepts we may wish to learn. We assume that each input vector
is of dimension d. We also assume that examples are independently and identically distributed (i.i.d) according to some fixed but unknown distribution .Then, the learning problem formulated as follows: we consider a fixed set of possible concepts , called hypothesis set. We receives a sample drawn i.i.d. according to as well as the labels , which are based on a specific target concept . In the semisupervised learning problem, we additionally have access to unlabeled sample drawn i.i.d according to . Our task is to use the labeled sample and unlabeled sample to find a hypothesis that has a small generalization error for the concept . The generalization error is defined as follows.
Definition 1.
(Generalization error) Given a hypothesis , a target concept , and unknown distribution , the generalization error of is defined by
(3) 
where is the indicator function of the event .
The generalization error of a hypothesis is not directly accessible since both the underlying distribution and the target concept are unknown Then, we have to measure the empirical error of hypothesis on the observable labeled sample . The empirical error is defined as follows.
Definition 2.
(Empirical error) Given a hypothesis , a target concept , and a sample , the empirical error of is defined by
(4) 
In learning problems, we are interested in how much difference there is between empirical and generalization errors. Therefore, in general, we consider the relative generalization error .
Iv DensityFixing Regularization
In this paper, we assume that is a class of functions mapping input vectors to the class densities:
(5) 
Therefore, we can replace the learning problem with a problem that approximates the true distribution with the estimated distribution .
We assume that the classconditional probability for labeled data
and that for unlabeled data (or test data) are the same:(6) 
Then, our goal is to estimate from labeld data drawn i.i.d from and unlabeled data
Theorem 1.
Let be the estimated distribution parameterized by , and be the true distribution. Then, we can write the sum of loglikelihood function as follows:
(7) 
where
is the KullbackLeibler divergence
[13] from to :(8)  
(9) 
This means that when we consider maximum likelihood estimation, we can decompose the objective function into two terms: the term depending on and the term depending only on .
Proof.
From Bayes’ theorem, we can obtain
(10)  
(11) 
Then, combining Eq (6), (10) and (11),
(12)  
Considering maximum likelihood estimation, we can have the loglikelihood function as follows:
(13)  
Finally, we compute sum of loglikelihood function,
(14)  
and then, we have Eq (7). ∎
Considering that we maximize Eq (7), it is clear that should be closer to . The Kullback–Leibler divergence is defined only if , implies , and this property is so called absolute continuity.
From the above theorem, if the probability of class occurrence is known in advance, it can be used to perform regularization. We call this term densityfixing regularization. Regularization is performed so that the density of each class in the inference result for the unlabeled sample approximates the . In addition, the KLdivergence has the following property: the best approximation satisfies
(15) 
for at which . This property is called zeroforcing, and we can see that our regularization behave as if the probabilities of classes we do not know remain .
V Asymptotic Normality
In this section, we discuss how the densityfixing regularization behaves asymptotically.
Theorem 2.
Let . The asymptotic variance of the maximum likelihood estimator applying the densityfixing regularization is given by . Here, is a function that always takes a positive value, parameterized by .
Proof.
In the maximum likelihood estimator for the number of samples , we can obtain the following by Taylor expansion of around :
here, we assume that be a thirdorder derivative with respect to parameter and be bounded. From Eq (V
) and central limit theorem, we can obtain
(17) 
when is sufficiently large. Here, is the Fisher information matrix:
(18) 
Then, let as the original likelihood function, we can obtain
(19)  
(20)  
(21)  
(22) 
Therefore, the maximum likelihood estimator applying the densityfixing regularization satisfies the following:
(23) 
Since the logarithmic function is a monotonic increasing function, the second derivative is always positive. Therefore, we can obtain the proof of Theorem 2 with . ∎
This theorem implies that the convergence rate of the asymptotic variance of the maximum likelihood estimator becomes faster by by applying the densityfixing regularization. Figure 1 illustrates the asymptotic behavior of the estimator by our regularization.
Vi Some Examples
In this section, we investigate the behavior of our proposed method by assuming some class distributions as examples. To summarize our results:

For discrete uniform distribution, the effect of regularization becomes weaker as the number of classes increases,

For Bernoulli distribution, our regularization behaves to give strong regularization when there is a class imbalance.
Figure 2 shows the behavior of the regularization terms under each distribution.
Via Discrete Uniform Distribution
We assume that the probability density function of classes
is as follows:(24) 
here is the number of classes. This is the discrete uniform distribution . Then, our regularization term is
(25) 
Thus, we can see that when the classes follows a discrete uniform distribution, the effect of regularization becomes weaker as the number of classes increases.
ViB Bernoulli Distribution
We assume that and the probability density function of classes is as follows:
(26) 
here and this is the Bernoulli distribution. Then, our regularization term is
(27)  
Thus, we can see that regularization is stronger when is away from . This means that in a binary classification, it behaves to give strong regularization when there is a class imbalance.
Vii Experimental Results
In this section, we introduce our experimental results. We implement the densityfixing regularization as follows:
(28) 
where is the crossentropy loss and is the weight parameter for the regularization term. The implementation of densityfixing regularization is straightforward, Figure 6
shows the few lines of code necessary to implement densityfixing regularization in PyTorch
[14].The datasets we use are CIFAR10 [15], CIFAR100 [15], STL10 [16] and SVHN [17]. We determined the prior distribution of classes based on the number of data accounted for in each class of the data set, and we used ResNet18 [18] as the baseline model.
Viia Supervised Classification
In this experiment, we assumed a discrete uniform distribution for the class distribution.
Figure 3 shows the experimental results for CIFAR10 with densityfixing regularization. As seen in the left of this figure, baseline model and densityfixing converge at a similar speed to their best test errors. At around epoch, a second loss reduction, Deep Double Descent [19], can be observed, but this phenomenon is not disturbed by densityfixing. From the right, we can see that by increasing the parameter , we can reduce the generalization gap.
Also, Table I shows the contribution of densityfixing to the reduction of test errors.
ViiB SemiSupervised Classification
In our experiments, we assumed a discrete uniform distribution for the class distribution and treated of the training data as labeled and of the training data as unlabeled.
Figure 4 show test loss and traintest differences for each in the semisupervised setting. We can see that by increasing the parameter , it reduce the generalization gap. In addition, CIFAR10 and CIFAR100, which consist of images from the same domain, have and classes, respectively, but the experimental results show that CIFAR10 has a more significant regularization effect than CIFAR100. This result supports our example in Eq (25).
Table II shows a comparison of classification error for each . These experimental results show that our regularization leads to improving error on the test data.
Dataset  Model  Top 1 Error  Top 5 Error 

CIFAR10  ResNet18  12.720%  0.812% 
ResNet18 + densityfixing ()  12.230%  0.779%  
ResNet18 + densityfixing ()  
CIFAR100  ResNet18  25.562%  6.710% 
ResNet18 + densityfixing ()  
ResNet18 + densityfixing ()  25.965%  6.887% 
ViiC Stabilization of Generative Adversarial Networks
Generative Adversarial Networks (GANs) [20] is one of the powerful generative model paradigms that are currently successful in various tasks. However, GANs have the problem that their learning is very unstable. We suggest that regularization by densityfixing contributes to improving the stability of GANs. The densityfixing formulation of GANs is:
(29) 
where is the discriminator, is the generator, is the binary cross entropy and .
Figure 5 illustrates the stabilizing effect of densityfixing the training of GAN when modeling a toy dataset (blue samples). The neural networks in these experiments are fullyconnected and have three hidden layers of ReLU units. We can see that densityfixing contributes to the stabilization of the training of GANs.
dataset  

CIFAR10  28.235  28.510  29.086  30.964  30.892  
CIFAR100  66.622  66.723  66.861  66.895  67.007  
STL10  59.770  60.110  60.124  60.405  60.897  
SVHN  27.937  28.028  30.110  32.025  32.879 
Viii Conclusion and Discussion
In this paper, we proposed a framework of regularization methods that can be used commonly for both supervised and semisupervised learning. Our proposed regularization method improves the generalization performance by forcing the model to approximate the prior distribution of the class. We proved that this regularization term is naturally derived from the formula of maximum likelihood estimation. We further investigated the asymptotic behavior of the proposed method and how the regularization terms behave when assuming a prior distribution of several classes in practice. Our experimental results have sufficiently demonstrated the effectiveness of our proposed method.
References
 [1] D. M. Hawkins, “The problem of overfitting,” Journal of chemical information and computer sciences, vol. 44, no. 1, pp. 1–12, 2004.
 [2] S. Lawrence, C. L. Giles, and A. C. Tsoi, “Lessons in neural network training: Overfitting may be harder than expected,” in AAAI/IAAI. Citeseer, 1997, pp. 540–545.

[3]
X. Zhu and A. B. Goldberg, “Introduction to semisupervised learning,”
Synthesis lectures on artificial intelligence and machine learning
, vol. 3, no. 1, pp. 1–130, 2009.  [4] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semisupervised learning with deep generative models,” in Advances in neural information processing systems, 2014, pp. 3581–3589.
 [5] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=r1Ddp1Rb
 [6] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
 [7] M. Kimura, “Mixup training as the complexity reduction,” arXiv preprint arXiv:2006.06231, 2020.

[8]
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in
Proceedings of the IEEE International Conference on Computer Vision
, 2019, pp. 6023–6032.  [9] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann, “Stabilizing training of generative adversarial networks through regularization,” in Advances in neural information processing systems, 2017, pp. 2018–2028.

[10]
M. Kimura and T. Yanagihara, “Anomaly detection using gans for visual inspection in noisy training data,” in
Asian Conference on Computer Vision. Springer, 2018, pp. 373–385.  [11] D.H. Lee, “Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks,” in Workshop on challenges in representation learning, ICML, vol. 3, no. 2, 2013.
 [12] A. Kumar, P. Sattigeri, and T. Fletcher, “Semisupervised learning with gans: Manifold invariance with improved inference,” in Advances in Neural Information Processing Systems, 2017, pp. 5534–5544.
 [13] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Statist., pp. 22:79–86, 1951.

[14]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,
Z. Lin, N. Gimelshein, L. Antiga et al.
, “Pytorch: An imperative style, highperformance deep learning library,” in
Advances in neural information processing systems, 2019, pp. 8026–8037.  [15] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
 [16] A. Coates, A. Ng, and H. Lee, “An analysis of singlelayer networks in unsupervised feature learning,” in Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 215–223.
 [17] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” 2011.

[18]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 770–778.  [19] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever, “Deep double descent: Where bigger models and more data hurt,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=B1g5sA4twr
 [20] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
Comments
There are no comments yet.