Consistent Batch Normalization for Weighted Loss in Imbalanced-Data Environment

01/06/2020 ∙ by Muneki Yasuda, et al. ∙ 38

In this study, we consider classification problems based on neural networks in a data-imbalanced environment. Learning from an imbalanced dataset is one of the most important and practical problems in the field of machine learning. A weighted loss function (WLF) based on a cost-sensitive approach is a well-known and effective method for imbalanced datasets. We consider a combination of WLF and batch normalization (BN) in this study. BN is considered as a powerful standard technique in the recent developments in deep learning. A simple combination of both methods leads to a size-inconsistency problem due to a mismatch between the interpretations of the effective size of the dataset in both methods. We propose a simple modification to BN, called weighted batch normalization (WBN), to correct the size-mismatch. The idea of WBN is simple and natural. Using numerical experiments, we demonstrate that our method is effective in a data-imbalanced environment.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning from an imbalanced dataset is one of the most important and practical problems in the field of machine learning, and it has been actively studied [8, 9]. This study focuses on classification problems based on neural networks in imbalanced datasets. Several methods have been proposed for these problems [9], for example, under/over-sampling, SMOTE [2], and cost-sensitive (CS) methods. We consider a CS-based method in this study. A simple type of CS method was developed by assigning weights to each point of data loss in a loss function [9, 4]. The weights in the CS method can be viewed as importances of corresponding data points. As discussed in Sec. 2, the weighting changes the interpretation of the effective size of each data point in the resultant weighted loss function (WLF). In an imbalanced dataset, the weights of data points that belong to the majority classes are often set to be smaller than those that belong to the minority classes. For imbalanced datasets, two different effective settings of the weights are known: inverse class frequency (ICF) [11, 15, 12] and class-balance loss (CBL) [3].

Batch normalization (BN) is a powerful regularization method for neural networks [13], and it has been one of the most important techniques in deep learning [7]. In BN, the affine signals from the lower layer are normalized over a specific mini-batch. As discussed in this paper, a simple combination of WLF and BN causes a size-inconsistency problem, which degrades the classification performance. As mentioned above, the interpretation of the effective size of data points is changed in WLF. However, in the standard scenario of BN, one data point is counted as just one data point. This inconsistency gives rise to the size-inconsistency problem.

The aim of this study is not to find a better learning method in a data-imbalanced environment, but to resolve the size-inconsistency problem in the simple combination of WLF and BN. In this study, we propose a consistent BN to resolve the size-inconsistency problem and demonstrate that the proposed consistent BN improves the classification performance using numerical experiments. This paper is an extension of our previous study [16]

, and it includes some new improvements: (i) the theory is improved (i.e., the definition of signal variance in the proposed method is improved) and (ii) more experiments are conducted (i.e., for a new data set and for a new WLF setting).

The remainder of this paper is organized as follows. In Sec. 2, we briefly explain WLF and the two different settings of the weights, i.e., ICF and CBL. In Sec 3

, we give a brief explanation of BN and show the performance degradation in a simple combination of WLF and BN (caused by the size-inconsistency problem) using numerical experiments for Iris and MNIST databases. In Sec. 

4, we propose a consistent BN, which can resolve the size-inconsistency problem, and justify the proposed method by using numerical experiments. As shown in the numerical experiments in Sec. 4.2, our method improves the classification performance. Finally, the summary and some future works are presented in Sec. 5. In the following sections, the definition sign “” is used only when the expression does not change throughout the paper. In the definitions of parameters that are redefined somewhere in the paper, we do not use the definition sign.

2 Weighted Loss Function in Imbalanced Data Environment

Let us consider a classification model that classifies an

-dimensional input vector

into different classes . It is convenient to use 1-of- vector (or one-hot vector) to identify each class [1]. Here, each class corresponds to the dimensional vector whose elements are and , i.e., a vector in which only one element is one and the remaining elements are zero. When , indicates class . For simplicity of notation, we denote the 1-of- vector, whose th element is one, by , so that . Thus, corresponds to class . Here, we denote the size of the data points belonging to class by , which is defined as , where is the Kronecker delta. By the definition, .

Suppose that a training dataset comprising data points: , where and are the th input and the corresponding target-class label represented by the 1-of- vector, respectively, is given. In the standard machine-learning scenario, we minimize a loss function given by

(1)

with respect to using an appropriate back-propagation algorithm, where denotes the data loss (e.g., the cross-entropy loss) for th data point and denotes the set of learning parameters in the classification model.

WLF is introduced as

(2)

where is the weight of the th data point and is the normalization factor. It can be viewed as a simple type of CS approach [9, 4]. In WLF, the th data point is replicated to data points. Hence, the relative occupancy (or the effective size) of the data points in each class is changed according to the weights. In WLF, the “effective” size of is then expressed by

(3)

In general,

(4)

holds.

The weights should be individually set according to the tasks. For an imbalanced dataset (i.e., some are very large (majority classes) and some are very small (minority classes)), there are two known settings: ICF [11, 15, 12] and CBL [3]. In the following sections, we briefly explain the two settings.

2.1 Inverse class frequency

Here, we assume that is an imbalanced dataset, implying that the sizes of s are imbalanced. In ICF, the weights are set to

(5)

where is assumed. The ICF weights effectively correct the imbalance in the loss function due to the imbalanced dataset. In WLF with ICF, the th data point that belongs to class is replicated to data points. Therefore, in this setting, the effective sizes of are equal to for any :

where and . This means that the effective sizes of (i.e., ) are completely balanced in WLF (see Fig. 1).

Figure 1: Example of two classes case: (minority class) and (majority class). The effective sizes of are the same: in ICF.

From Eq. (4), in ICF.

2.2 Class-balanced loss

Recently, a new type of WLF (referred to as class-balanced loss (CBL)), which is effective for imbalanced datasets, was proposed [3]. In CBL, the weights in Eq. (2) are set to

(6)

for , where . In CBL, is also assumed. The CBL weights corrects the effective sizes of data points by taking the overlaps of data points into account. Here,

is a hyperparameter that smoothly connects the loss function in Eq. (

1) and WLF with ICF. WLF with CBL is equivalent to Eq. (1) when , and it is equivalent to WLF with ICF when , except for the difference in the constant factor:

Hence, CBL can be viewed as an extension of ICF. In Ref. [3], it is recommended that has a value close to one (e.g., around ). In CBL, the effective sizes of are given by

(7)

From Eqs. (4) and (7),

in CBL.

As mentioned, in the limit of ; therefore, are completely balanced in this limit. In fact, when . The complete balance gradually begins to break as decreases. When is very small (i.e., is very close to one), Eq. (7) is expanded as

Therefore, when ,

This means that and are almost balanced as compared with and when is very small.

3 Batch Normalization

Consider a standard neural network for classification whose th layer consists of units (). The zeroth layer () is the input layer, i.e., , and the network output, , is determined from the output signals of the th layer. In the standard scenario for the feed-forward propagation of input in the neural network, for , the th unit in the th layer receives an affine signal from the th layer as

(8)

where is the directed-connection parameter from the th unit in the th layer to the th unit in the th layer, and is the output signal of the th unit in the th layer, where is identified as . After receiving the affine signal, the th unit in the th layer outputs

(9)

to the upper layer, where is the bias parameter of the unit and

is the specific activation function of the

th layer. In the classification, for input , the network output (the 1-of- vector) is usually determined through the

class probabilities,

, which are obtained via the softmax operation for ():

(10)

The network output is the 1-of- vector whose th element is one, where . The feed-forward propagation is illustrated in Fig. 2.

Figure 2: Schematic illustration of the feed-forward propagation for classification.

Learning with BN is based on mini-batch-wise stochastic gradient descent (SGD) 

[13]. In BN, the training dataset, , is divided into mini-batches: . In the following sections, we briefly explain the feed-forward and back propagation during training in the th layer for mini-batch . Although all the signals appearing in the explanation depend on the index of mini-batch , we omit the explicit description of the dependence of unless there is a particular reason.

3.1 Feed-forward propagation

In the feed-forward propagation for input of the th layer with BN, the form of in Eq. (8) is replaced by

(11)

where

(12)

and

(13)

are the mean and (unbiased) variance of the affine signals over , respectively. Here, denotes the size of and is assumed. The constant in Eq. (11) is set to a negligibly small positive value to avoid zero division, and it is usually set to . In BN, the distribution of the affine signals over is standardized. This means that the mean and (unbiased) variance of over are always zero and one (when ), respectively:

(14)

The factors in Eq. (11) are also learning parameters with and , which are determined via an appropriate back-propagation algorithm.

It is noteworthy that in the inference stage (after training), we use and (i.e., the sample average values over all the mini-batches) instead of and , respectively, in Eq. (11), where and are the mean and variance, respectively, computed for mini-batch , namely, they are identified as Eqs. (12) and (13), respectively.

3.2 Back propagation and gradients of loss

In this section, we briefly explain the back-propagation rule with or without BN. For expressing the back-propagation rule, we define a back-propagating signal of the th unit in the th layer for the th data point () as

(15)

where is the loss over mini-batch and is the loss for the th data point, i.e., for Eq. (1) or for Eq. (2). Here, is the normalization factor that is for Eq. (1) or is for Eq. (2), where

(16)

First, we show the back-propagation rule for the th layer “without” BN, i.e., the case in which the form of is defined by Eq. (8

). In this case, the back-propagation rule is obtained by using the chain rule and Eqs. (

8) and (9):

(17)

for , where . Note that is obtained by directly differentiating by . By using the back-propagating signals, the gradients of with respect to and are expressed as

(18)
(19)

respectively, for .

Next, we show the expression of the back propagation rule for the th layer “with” BN, i.e., the case in which the form of is defined by Eq. (11). Using a similar manipulation with Eq. (17) (i.e., using the chain rule and Eqs. (11) and (9)), we obtain

(20)

for , where

In this case, the gradients of with respect to , , and are as follows: The gradient has the same expression as Eq. (18). The gradients with respect to and are

(21)
(22)

respectively, for .

3.3 Numerical experiment

In the experiments in this section, we used two different datasets, namely Iris and MNIST. Iris is a database for the classification of three types of irises, whose each data point consists of a four-dimensional input data (sepal length, sepal width, petal length, and petal width) and the corresponding target iris label (setosa, versicolor, and virginica). MNIST is a database of handwritten digits, . Each data point in MNIST includes the input data, a digit image, and the corresponding target digit label.

For the two different datasets, we considered some imbalanced two-class classification problems. For Iris, (i) the classification problem of “versicolor” and “virsinica” and (ii) the classification problem of “versicolor” and “setosa” were considered. For MNIST, (I) the classification problem of “one” and “zero”, (II) the classification problem of “one” and “eight”, and (III) the classification problem of “four” and “seven” were considered. The number of training data and test data points used in these experiments are shown in Tabs. 1 and 2, which were randomly picked up from Iris and MNIST, respectively. In the experiments for MNIST, the minority data accounts for less than 1% of all data. All the input data were standardized in the preprocessing. For the experiments, we used a three-layered neural network (): input units,

hidden units with rectified linear function (ReLU) activation (

[6] (i.e., first hidden layer), and hidden units with (i.e., second hidden layer). The network output, , is computed from the output signals of the second hidden layer via the softmax operation as explained in the first part of this section. We adopted BN for the first and second hidden layers. In the training of the neural network, we used the He initialization [10] for the first hidden layer and the Xavier initialization [5] for the second hidden layer, and used the Adam optimizer [14]. The cross-entropy loss,

was used, where is the class probability defined in Eq. (10). We set and for Iris and and for MNIST. The mini-batch sizes were and for Iris and MNIST, respectively.

(i) (ii)
versicolor (minority) virginica (majority) versicolor (minority) setosa (majority)
train 6 50 5 50
test 50 50 50 50
Table 1: Number of data points used in experiments (i) and (ii) for Iris.
(I) (II) (III)
1 (minority) 0 (majority) 1 (minority) 8 (majority) 4 (minority) 7 (majority)
train 45 5923 45 5851 39 6265
test 1135 980 1135 974 982 1028
Table 2: Number of data points used in experiments (I), (II), and (III) for MNIST.

The results of the experiments for Iris dataset ((i) and (ii)) and for MNIST dataset ((I), (II), and (III)) are shown in Tabs. 3 and 4, respectively. For each dataset, we used three different methods: (a) the standard loss function, , combined with BN (LF+BN), (b) WLF, , with ICF combined with BN (WLF(ICF)+BN), and (c) WLF with CBL () combined with BN (WLF(CBL)+BN). The accuracies shown in Tabs. 3 and 4

are the average values over 10 experiments (in all the experiments, the training and test datasets were fixed). In each experiment, we chose the best model from the perspective of the classification accuracy for the test set, obtained during 100 epoch training. The classification accuracies for the majority classes are very good and those for the minority classes are poor as we expected. However, almost all the results of WLF are worse than those of the standard loss function. This means that the correction for the effective data size in WLF does not work well in imbalanced datasets.

The results of (a) and (b) in Tab. 4 are slightly worse than those in the previous study [16], because some experimental settings were different. For example, in the previous study, we constructed the mini-batches in such a way that all of them certainly include the data points in the minority class, but we did not do it in the current experiments. Furthermore, the results shown in the previous study were not average values but the best ones obtained from several experiments. However, this performance difference is not essential for our main claim in this paper. Because our aim is to resolve the size-inconsistency problem and to improve the classification performance, the difference of the baseline is not important.

(i) (ii)
versicolor virginica overall versicolor setosa overall
(a) LF+BN 39.4% 100% 69.7% 61.4% 100% 80.7%
(b) WLF(ICF)+BN 30.8% 100% 65.4% 47.4% 100% 73.7%
(c) WLF(CBL)+BN 40.2% 100% 70.1% 56.8% 100% 78.4%
Table 3: Classification accuracies (of each iris and of overall) for the test set in Iris.
(I) (II) (III)
1 0 overall 1 8 overall 4 7 overall
(a) LF+BN 66.9% 100% 82.2% 69.7% 100% 83.7% 34.4% 100% 68.0%
(b) WLF(ICF)+BN 63.8% 100% 80.6% 71.2% 100% 84.5% 27.6% 100% 64.7%
(c) WLF(CBL)+BN 64.8% 100% 81.1% 73.1% 100% 85.5% 28.1% 100% 64.8%
Table 4: Classification accuracies (of each digit and of overall) for the test set in MNIST.

4 Weighted Batch Normalization

As demonstrated in Sec. 3.3, the simple combination of WLF and BN is not good. This is considered to be due to the inconsistency of the effective data size in WLF and BN. As mentioned in Sec. 2, in WLF, the interpretation of the effective sizes of data points is changed in accordance with the corresponding weights. However, in BN, one data point is treated as just one data point in the computation of the mean and variance of the affine signals (cf. Eqs. (12) and (13)). This fact causes a size-inconsistency problem, which results in the degradation of the classification performance as shown in Sec. 3.3. To resolve this problem, we propose a modified BN (referred to as weighted batch normalization (WBN)) for WLF.

4.1 Feed-forward and back propagations for WBN

The idea of our method is simple and natural. To maintain consistency with the interpretation of data size, the sizes of the mini-batches should be reviewed according to the corresponding weights in BN. This implies that

(23)

and

(24)

should be used in Eq. (11) instead of Eqs. (12) and (13), where

(25)

is the normalization factor for the variance. The normalization factor is already defined in Eq. (16). Eqs. (23) and (24) are the modified versions of Eqs. (12) and (13), weighted in the same manner as WLF. The normalization factors in Eqs. (23) and (24) ensure that they are unbiased; see Appendix A. In the previous study [16], the normalization factor was defined by . However, such a definition is not appropriate from the perspective of unbiasedness of . When is a constant, WBN is equivalent to BN, because Eqs. (23) and (24) are reduced to Eqs. (12) and (13) in this case.

In BN, the distribution of over is standard as shown in Eq. (14). On the other hand, in WBN, it is not standard but its weighted distribution is standard:

when . This standardization property reflects the effective data size in WLF. In WLF, the th data point is replicated according to the corresponding weight as mentioned in Sec. 2. Therefore, the signal , which corresponds to the th data point, should also be replicated in the same manner. The above standardization property implies this notion.

In WBN, the back-propagation rule in Eq. (20) is modified as

(26)

The gradients of with respect to and are the same expressions as Eqs. (18) and (21), respectively. The gradient with respect to in Eq. (22) is modified as

(27)

4.2 Numerical experiment

In this section, we show the experimental result of the proposed method for the imbalanced classification problems ((i) and (ii) for Iris and (I)–(III) for MNIST) described in Sec. 3.3. The detailed settings of the experiments were basically the same as that in Sec. 3.3. For each experiment, we used two different methods: (d) WLF with ICF combined with WBN (WLF(ICF)+WBN), and (e) WLF with CBL () combined with WBN (WLF(CBL)+WBN). The results of the experiments for Iris ((i) and (ii)) and for MNIST ((I)–(III)) are shown in Tabs. 5 and 6, respectively. The results in the tables are the average values obtained over 10 experiments. We observe that the classification accuracies for the minority classes in all the experiments are largely improved (cf. Tabs. 3 and 4). This means that the correction for the effective data size in WLF works well, because the size-inconsistency problem is resolved by WBN. The results of CBL are better that those of ICF in all the experiments.

We executed the same experiments using WBN proposed in the previous study, in which was defined by  [16]. The results obtained from these experiments were worse than those shown in Tabs. 5 and 6. This means that the definition of in Eq. (25) is better from the perspective not only of the unbiasedness of the variance but also of the performance of classification.

(i) (ii)
versicolor virginica overall versicolor setosa overall
(d) WLF(ICF)+WBN 59.6% 100% 79.8% 79.6% 100% 89.8%
(e) WLF(CBL)+WBN 77.2% 100% 88.6% 84.4% 100% 92.2%
Table 5: Classification accuracies (of each iris and of overall) for the test set in Iris.
(I) (II) (III)
1 0 overall 1 8 overall 4 7 overall
(d) WLF(ICF)+WBN 85.7% 100% 92.3% 79.6% 100% 89.0% 48.7% 100% 75.0%
(e) WLF(CBL)+WBN 87.6% 100% 93.4% 80.6% 100% 89.6% 49.9% 100% 75.5%
Table 6: Classification accuracies (of each digit and of overall) for the test set in MNIST.

5 Summary and Future Works

In this paper, we proposed a new BN method for the learning based on WLF. The idea of the proposed method is simple but essential. The proposed BN (i.e., WBN) can resolve the size-inconsistency problem in the combination of WLF and BN, and it improves the classification performance in data-imbalanced environments, as demonstrated in the numerical experiments. We verified the validity of WBN for two different databases, Iris and MNIST, and for two different weight settings, ICF and CBL, though numerical experiments. However, it is important to check the validity of WBN for other databases and other weight settings. Furthermore, deepening the mathematical aspect of WBN, such as the internal covariance shift in WBN, is also important. We will address them in our future studies.

We have considered only the classification problem in this study. The size-inconsistency in the combination of WLF and BN will also arise in other types of problems, e.g., regression problem. Our idea is presumably applicable to those cases, because the idea of our method is independent from the style of output. We will also address the application of WBN to problems other than the classification problem in our future studies.

acknowledgments

This work was partially supported by JSPS KAKENHI (Grant Numbers: 15H03699, 18K11459, and 18H03303), JST CREST (Grant Number: JPMJCR1402), and the COI Program from the JST (Grant Number JPMJCE1312).

Appendix A Unbiasedness of Estimators in Equations (23) and (24)

In this appendix, we show that the mean and variance in Eqs. (23) and (24) are unbiased. By omitting indices ( and ) unrelated to this analysis, they are expressed as

(28)

Here, we assume that are i.i.d. samples drawn from a distribution . The mean and variance of are denoted by and , respectively. The expectation of in Eq. (28) over is

Therefore,

is the unbiased estimator. Similarly, the expectation of

in Eq. (28) is

Here, we used

Therefore, is also an unbiased estimator.

References

  • [1] C. M. Bishop (2006) Pattern recognition and machine learning. Springer-Verlag New York. External Links: ISBN 978-0-387-31073-2 Cited by: §2.
  • [2] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of Artificial Intelligence Research

    16, pp. 321–357.
    Cited by: §1.
  • [3] Y. Cui, M. Jia, T. Lin, Y. Song, and S. Belongie (2019) Class-balanced loss based on effective number of samples. arXiv preprint arXiv:1901.05555. Cited by: §1, §2.2, §2.
  • [4] J. P. Dmochowski, P. Sajda, and L. C. Parra (2010) Maximum likelihood in cost-sensitive learning: model specification, approximations, and upper bounds. The Journal of Machine Learning Research 3, pp. 3313–3332. Cited by: §1, §2.
  • [5] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In Proc. of the 13th International Conference on Artificial Intelligence and Statistics 9, pp. 249–256. Cited by: §3.3.
  • [6] X. Glorot, A. Bordes, and Y. Bengio (2011-11–13 Apr) Deep sparse rectifier neural networks. In Proc. of the 14th International Conference on Artificial Intelligence and Statistics 15, pp. 315–323. Cited by: §3.3.
  • [7] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §1.
  • [8] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing (2017) Learning from class-imbalanced data: review of methods and applications. Expert Systems with Applications 73, pp. 220–239. External Links: ISSN 0957-4174 Cited by: §1.
  • [9] H. He and E. A. Garcia (2009) Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering 21 (9), pp. 1263–1284. Cited by: §1, §2.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .

    In Proc. of the 2015 IEEE International Conference on Computer Vision

    , pp. 1026–1034.
    Cited by: §3.3.
  • [11] C. Huang, Y. Li, C. C. Loy, and X. Tang (2016) Learning deep representation for imbalanced classification. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5375–5384. Cited by: §1, §2.
  • [12] C. Huang, Y. Li, C. C. Loy, and X. Tang (2018)

    Deep imbalanced learning for face recognition and attribute prediction

    .
    arXiv preprint arXiv:1806.00194. Cited by: §1, §2.
  • [13] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proc. of the 32nd International Conference on International Conference on Machine Learning 37, pp. 448–456. Cited by: §1, §3.
  • [14] D. P. Kingma and L. J. Ba (2015) Adam: a method for stochastic optimization. In Proc. of the 3rd International Conference on Learning Representations, pp. 1–13. Cited by: §3.3.
  • [15] Y. Wang, D. Ramanan, and M. Hebert (2017) Learning to model the tail. In In Proc. of the Advances in Neural Information Processing Systems 30, pp. 7029–7039. Cited by: §1, §2.
  • [16] M. Yasuda and S. Ueno (2019) Improvement of batch normalization in imbalanced data. In Proc. of the 2019 International Symposium on Nonlinear Theory and its Applications, pp. 146–149. Cited by: §1, §3.3, §4.1, §4.2.