Deep NNs have achieved remarkable success in a variety of different application domains such as Computer Vision[he2017mask, ronneberger2015unet, teichmann2018multinet]saufi2019challanges, shone2018deep]anderson2018bottom, devlin2018bert]
. The training success of NN models mainly relies on three fundamental requirements: (i) the amount of available training data; (ii) the theoretical capacity of the trainable model; and (iii) the learning strategy including the hyperparameters. All requirements are interdependent and have to be considered as a whole.
The capacity of the model is mainly defined by its architecture, which has to be big enough to learn potentially complex data relations without causing an overadaption to the training data resulting in poor generalization capabilities. Given inappropriate combinations of network size and training data special measures have to be considered to prevent overfitting [zhang2017understanding]. Popular countermeasures include (i) artificially increasing the amount of training data [cubuk2019autoaugment, guo2019mixup]; (ii) reducing the network’s capacity; and (iii) changing the learning strategy [yaguchi2018adam]. The reduction of the capacity can either be done explicitly by reducing the amount of learnable parameters (i.e. pruning) [goodfellow2016deep, liu2017learning, klemm2019deploying] or implicitly by regularizing the trainable parameters of a model [nowlan1992simplifying, Ioffe2015]. While a reduction of the NN’s capacity can lead to less powerful networks, synthetically generated training data can induce a variety of side effects [geirhos2018imagenet]. Different training strategies do not change the network’s capacity and training data. However, the effect of these strategies is yet not completely understood so that they still require a trial-and-error usage [reddi2018convergence].
In this work, we propose a different approach to tackle overfitting, namely targeted sparsity regularization. In contrast to existing work in which sparsity is often considered a desirable network property [ide2017improvement, papyan2018theoretical]
, we demonstrate that sparse activations in intermediate layers are in fact a reliable indicator for overfitting. By exploiting this concept, we derive a novel visualization strategy to diagnose overfitting in individual layers of convolutional neural networks during training. Furthermore, these insights are used to introduce a targeted per-layer regularization strategy which avoids the under-utilization of the network’s capacity while increasing the overall performance. Moreover, we demonstrate how targeted sparsity regularization enables to train larger NN architectures on comparatively small datasets while preventing overfitting entirely even when trained excessively. As sparsity also limits the amount of information stored in a given number of activations[pezzotti2018deepeyes], we furthermore demonstrate that the predictive power of a NN with a fixed size increases when information from multiple extracted features are combined.
In summary we present four central contributions:
We provide an interactive visualization method based on the sparsity of activations that enables the identification of overfitting on a per-layer basis which can directly be embedded into TensorBoard.
We introduce a novel targeted regularization strategy exploiting the sparsity of the activations in combination with decorrelated convolutional filters, which can prevent overfitting even for very long trainings.
We demonstrate that this regularization strategy significantly increases image classification performance on well-known datasets using different NN architectures while outperforming both dropout and batch normalization.
We provide novel insights into the seemingly contradicting concepts of activation sparsity versus network capacity by demonstrating that deep NNs can be regularized in order to learn distinctive and salient features without inducing low or redundant activations.
2 Related Work
Work on regularizing overfitting can be separated into two different categories, namely topology-based and loss-based regularization. As the name suggests, topology-based regularization changes the neural connections or incorporates additional layers into the network’s architecture. The two most common examples are dropout [Srivastava2014] and batch normalization [Ioffe2015]
. Dropout layers temporarily switch off random neurons during the training process, inducing a noisy input to the subsequent (hidden) layers[Srivastava2014]. In contrast, batch normalization layers were initially designed to reduce the internal covariate shift to accelerate the training process [Ioffe2015]
. Since the normalization is initialized by subtracting the mean and dividing by standard deviation of the current batch activations, this also induces noise to subsequent layers which results in regularization side-effects. The common principle of those methods is to artificially modify the training input and hence reduce the chance of overadaption. While this approach has proven to be effective in many applications, the introduced noise can also prevent optimal performance, as can be seen when combining dropout and batch normalization, which in general leads to degraded performances[Li2019, Klambauer2017].
In loss-based regularization strategies neither the training input nor the network’s architecture is changed. Instead, the loss is altered by an explicit regularization term to cope with overfitting. Loss-based regularization can again be separated into two common paradigms, differing in the domain of the regularizer, which can either be the hidden activations / weights or the output distribution. Based on the observation that large deep neural networks often lead to redundancies [ayinde2019correlation], many of the activation- / weight-based strategies aim to extenuate the model complexity by using weight decay [nowlan1992simplifying, ayinde2017deep] or pruning the network directly [molchanov2016pruning, liu2017learning]. Similarly, it has been shown that classical L1 regularization induces sparsity while L2 regularization causes the decay of weights over time [yaguchi2018adam, mehta2019implicit]. Even though these techniques can improve the generalizability and performance of the network, they underutilize the potential capacity of the model [ayinde2019regularizing].
In contrast, decorrelation-based regularizers aim to improve the networks while attempting to employ the given capacity. Both hidden features [bengio2009slow] and activations [bao2013incoherent, cogswell2015reducing]
are utilized to reduce the redundancy within the model. Others try to avoid correlations by reducing the cosine similarity among feature vectors to avoid overfitting while improving the overall performance[ayinde2019regularizing].
In the second category of loss-based regularization, the output distribution of the network is used to achieve less overfitting. In particular maximum entropy-based regularization has long been studied to regulate the models behavior [miller1996global]. In a more recent approach Pereyra et al. propose to penalize the entropy of highly confident Softmax outputs [pereyra2017regularizing]. However, since the entropy is derived from the predictions of the last layer, this regularization does not directly target the layers, which cause the loss of information.
In fact, all the optimization strategies mentioned above have in common that they regularize the network in a non-targeted manner: Instead of identifying the layers which are responsible for overfitting, the regularization is applied to either all layers (e.g. ecorrelation-based regularization) or to randomly selected entities within the network (e.g. dropout). Furthermore, none of the above-mentioned techniques explicitly address the contradicting trade-off between sparse but salient activations (i.e. sparsity) and low or redundant feature responses (i.e. network capacity). In contrast our proposed method allows a targeted regularization and decorrelation of layers thus avoiding overfitting of the NN.
3.1 Measuring Sparsity
Sparse activations in CNNs are inspired by mammalian visual cortex cells and have been studied extensively for feature extraction purposes[Olshausen1996]. Based on the idea that each filter of a convolutional layer is trained to identify a particular feature of a given input [Zeiler2014], this work analyzes the perplexity [pezzotti2018deepeyes] of receptive field activations to quantify sparsity (C1). The main reason for analyzing perplexity is to determine whether a filter is trained in such a way that it returns an overconfident output distribution for a given receptive field.
In this work, the proximate receptive field is defined as the region of a layer’s direct input that a filter is being affected by (see Figure 1). This definition differs from the conventional definition of receptive fields which usually define the area in the network’s input space. We will further focus on 2D convolutions, common for image data. All reasoning, however, also applies to other dimensionalities and dense layers.
A layer with filters may create a
shaped feature tensor. Thus, the input of this layer consists ofreceptive fields. Let be a pixel at position and channel of receptive field . The corresponding weight of filter which affects this pixel is given as . The number of pixels in is equal to the number of weights in . Hence, the linear activation created by applied to is defined as
To improve readablity, we denote the results of this linear filtering as receptive field activation vectors (RFAV) , where is a linear index over all receptive fields computed by (see Figure 1).
We encode the sparsity of receptive field activations with the help of perplexity. Perplexity is defined as
is a discrete probability distribution andthe corresponding entropy of p. Since perplexity is strictly monotonically increasing w.r.t. entropy, our approach focuses on entropy only. In order to calculate the entropy of a RFAV , its components have to be transformed into a probability-like distribution. This can be done by applying the Softmax function:
where is the -th component of . For every receptive field we obtain the corresponding RFAV entropy by
The values of correlate with the respective sparsities, such that sparse activations result in RFAV entropy values close to zero and dense activations in large values. Encoding and re-ordering all of a layer in a 2D heat map allows to visualize the localized sparsity of activations for a given input (see Figure 1 and contribution C1).
3.2 Regularizing Sparsity
As we will show in subsection 4.1, when overfitted, neurons (i.e. filters) in a NN have been adapted to particular features and only cover individual observations including noise and fluctuations. Thus, sparse activations become much more common when a neural network overfits (see Figure 12). Instead of preventing overfitting by artificially modifying training data, we propose a penalty term that prevents the generation of sparse activations directly (C2).
As entropy is differentiable, it can be used as an activity regularizer to control RFAV sparsity. In order to regularize sparsity, the loss function is extended by a penalty term
where loops through all layers of the NN, is the number of receptive fields in the respective layer and refers to as defined in Equation 4 w.r.t. layer . In the following this regularization is referred to as SparsityReg. Furthermore, can be used to toggle the sparsity regularizer of layer .
As entropy reaches its maximum when all filters are activated in the same way, this regularizer can have the tendency to produce highly correlated filters. The trivial solution for to be minimized is therefore to generate identical filter responses. As identical filters reduce the predictive power of NNs, this effect has to be counterbalanced by preventing high filter correlations.
3.3 Regularizing Filter Correlations
Considering a layer with filters (or neurons), each filter consists of weights . If the weights of a neuron strongly correlate (or anti-correlate) with the weights of another neuron , they create redundant feature maps. The correlation coefficient of these neurons is calculated as follows (Pearson correlation):
with being a vector holding the mean of all ’s in each component. Calculating pairwise correlation coefficients of all filters in a layer results in a correlation matrix. Correlations can be visualized using a 2D histogram (see Figure 17). Here, the
-axis corresponds to the epoch,-axis to the correlation coefficient. Due to symmetry it suffices to analyze the lower triangular matrix of the correlation matrix. Color encodes the frequency how often a correlation coefficient appears in the lower triangle of the correlation matrix for the corresponding epoch. In this visualization one can observe how the correlation coefficients are distributed and how this distribution changes in the course of training.
Since Pearson correlation coefficient is a continuous differentiable function, a correlation regularizer can be added to the training loss in order to prevent high correlations (C2). The correlation regularizer is given as
where is used to control the strength of correlation regularization of layer . In the following this regularization is referred to as DecorrReg. The overall loss used for training can now be written as:
where denotes the loss used to measure inference quality (e.g. categorical cross-entropy loss).
First, we evaluate our method using a common baseline architecture depicted in Figure 2
. The basic network comprises two conv-conv-pool blocks followed by two dense (or fully connected) layers. ReLU is used as activation across all layers. In the following, this network is referred to asvanillaNet. A second architecture, where every conv block is extended by a batch normalization layer, we refer to as normNet. A vanillaNet with added dropout () in every conv block is called dropNet. Information on different architecture sizes can be found in Table 1
. If not stated otherwise, categorical cross-entropy loss and SGD optimizer with learning rate of 0.01 and no momentum are used for evaluation. The networks are trained for 100 epochs with a batch size of 64. During training, a total of 1,000 samples, which are neither part of the training nor of the validation dataset, are used to calculate the layer’s RFAV sparsity. In total, three datasets (cifar-10, cifar-100, tiny-imagenet) are considered for training. Throughout all training runs’were set to either or to switch correlation regularization of the respective layer off or on. Sparsity regularization was controlled by setting to either or . This value was chosen manually and offered favorable results throughout our experiments.
4.1 Monitoring Layer Sparsity
Initially, small, medium-sized, and xxl vanillaNets were trained (see Table 1). Figure 6 depicts the evolution of their validation losses. All vanillaNet trainings show a significant increase of the validation loss values starting in early stages of training. To visualize the localized sparsity of activations we are encoding the entropy of each proximate receptive field of a given layer in a heat map (see Figure (a)a). Here, prominent features of the input image are recognizable by a lower entropy. Therefore, some filters of the respective layers generate a stronger activation for these features in comparison to the other filters. The visualized network starts overfitting from the 8th epoch (see Figure (c)c, vanillaNet-xxl). In the heat maps, starting from the 10th epoch, a decrease of entropy becomes apparent throughout the entire image in all layers. This change is analyzed in more detail in Figure 12
. Here, the mean of 1,000 entropy heat maps is plotted for all epochs. Especially the entropies of conv3, conv4, and fc1 undergo a rapid change shortly before and after the moment of overfitting. As part of our experiments, we have been able to recognize this effect throughout the trainings.
|sr1Net||vanillaNet + SparsityReg (conv1 - fc1)||0.2663||0.4917||0.5325||0.3960||0.8079||0.2714|
|sr2Net||vanillaNet + SparsityReg (conv3 - fc1)||0.3460||0.5023||0.5387||0.4025||0.8042||0.3123|
|sr3Net||sr2Net + DecorrReg||0.3781||0.5219||0.5423||0.4079||0.8040||0.3052|
When applying dropout, overfitting in dropNet-s and dropNet-m is prevented over the entire course of training (see Figure (a)a and (b)b). The dropNet-m and dropNet-xxl networks achieve a considerably lower loss than the corresponding vanillaNets. However, dropNet-xxl still overfits starting around epoch 30 (see Figure (c)c). Again, the corresponding heat maps reveal a noticeable change in entropy around this moment of training (see Figure (b)b). Similar to vanillaNet, distinct features are visible due to lower entropy but not as apparent as in its unregularized counterpart. The mean entropy plot also reveals a drop of entropy around the moment of overfitting, but not as strong as in vanillaNet (see Figure 12). Nevertheless, dropNet shows a significantly higher performance for the xxl network training compared to vanillaNet (see Table 2).
With the help of batch normalization, overfitting in the normNet-xxl network can be reduced. Nevertheless, the loss starts increasing slowly around the 20th epoch (see Figure (c)c). In contrast to vanillaNet and dropNet, the sparsity heat maps of all layers in normNet remain stable except for the first epoch. The range of observed entropy values per layer remains small. The mean entropy hardly changes throughout the whole training (see Figure 12). Compared to vanillaNet and dropNet, normNet achieves the best classification accuracy for all network sizes (see Table 2).
We observed that the use of dropout and batch normalization have almost always led to a lower RFAV sparsity and higher accuracies than the unregularized trainings (C1). Layer fc2 is not considered for the entropy analysis. Here, the entropy decreases in all trainings due the the use of the categorical cross-entropy loss function. A low entropy in the last fully-connected layer of a NN implies that the number of labels that are considered for classification is decreasing. This is an expected observation and is therefore not considered as related to the overfitting phenomenon here.
4.2 Targeted Sparsity Regularization
In subsection 4.1, we observed that overfitting can be associated with a strong change in RFAVs’ mean entropy and therefore with a higher sparsity in the respective layers. To get a better understanding on how this affects our trainings, we apply SparsityReg to force the network to prevent sparsity and counteract overfitting (C2, C4).
First, the RFAV entropy of all layers is regularized (srNet1). The validation loss of these runs shows a decrease or convergence throughout the entire training (see Figure 6). The sr1Net-m and sr1Net-xxl trainings achieve a smaller loss than its vanillaNet, dropNet, and normNet counterparts. The accuracies of these regularized trainings can be seen in Table 2. We can see that the larger the network, the higher accuracies can be obtained. While this seems to be perfectly in line with expectations, it should be noted that vanillaNet and dropNet can not benefit from the increased size. The sr1Net-xxl network achieves a significantly higher accuracy than the unregularized (+0.1669), dropout (+0.1123), and batch normalized (+0.0266) equivalents.
Layer-wise analysis of entropy has revealed that especially the layers conv3, conv4, and fc1 show a strong change in entropy (see Figure 12 and C1). When applying a targeted regularization to only these layers (sr2Net), validation loss reaches the lowest values across all network sizes. This targeted regularization also ensures that the networks do not overfit. Here, the accuracy is improved in nearly all trainings compared to sr1Net (see Table 2).
In subsection 3.2, we pointed out that there is a risk of learning redundant filters when maximizing RFAV entropies. In order to improve the understanding of redundant filters, the pairwise correlation coefficients of the neuron weights of the respective layers are plotted in a histogram (see Figure 17 and C1). Comparing vanillaNet-xxl and sr1Net-xxl trainings, we observe that layers’ correlations hardly differ. However, regardless of our regularization, the first layer stands out in all trainings by revealing a higher correlation among the individual neurons than in the other layers. Therefore, we used the DecorrReg on conv1 layer plus targeted SparsityReg (sr3Net) and achieved the highest accuracy throughout all experiments and network sizes on cifar-100 when trained for 100 epochs (see Table 2). Furthermore, as shown in Figure (c)c the DecorrReg can effectively remove the correlations in the conv1 layer.
The analysis of sparsity-regularized losses has shown that even training large networks no longer exhibits an increase of validation loss (see Figure (c)c). In another experiment, we tried to push sparsity-regularized networks into overfitting. Here, we trained the xxl networks for 200 epochs. It turned out that all runs, except for the sparsity-regularized trainings, show a deterioration of the validation losses. All sparsity-regularized trainings, on the other hand, decrease continuously and converge against a certain value. Again, the sparsity-regularized runs achieved the lowest losses and the highest accuracies (up to 0.5513 by sr3Net).
In addition to the cifar-100 dataset, we also trained the xxl networks on cifar-10 and tiny-imagenet. The classification accuracies can be seen in Table 2. Again, applying sparsity regularization we are able to outperform vanillaNet (+0.1115 / +0.0925), dropNet (+0.0217 / +0.0677), and normNet (+0.0179 / +0.0462) on cifar-10 / tiny-imagenet.
Beside our baseline architectures, we also trained LeNet and AlexNet with and without sparsity regularization. LeNet training does not show a significant improvement of the classification accuracy. Our results so far have shown that sparsity regularization does not show any positive effects on small sized network trainings. Due to the size of the LeNet network, this can also be observed here. In contrast, AlexNet showed a significant improvement of the accuracy throughout all datasets when applying sparsity regularization.
As an alternative to SGD we also evaluated our regularization strategy against Adam optimization (Table 2). In alignment with the reported tendency of this adaptive method [yaguchi2018adam], our visualization clearly indicates high sparsity (Figure A.1). We further demonstrate that sparsity regularization can cause strongly correlating features if applied to method-induced sparsity (Figure A.2). Interestingly, the effect of explicit decorrelation can counteract filter correlation but does not induce significant performance improvements during Adam optimization and thus requiring further investigations into the nature of sparsity and network capacity. However, as shown in Table 2, our targeted regularization strategy achieves comparable performance on cifar-100 using a much smaller architecture (Adam xxl vs. sr3Net-m) and outperforms Adam with batch normalization if equally sized networks are used (Adam xxl vs. sr3Net-xxl).
Throughout this paper we have studied sparsity-induced overfitting using novel visualizations. We conclude that sparse layer responses can be encoded by the entropy of the receptive field activation vector. When overfitted, filters have been trained in a way that only few neurons have learned particular features of their respective inputs. These overconfident output distributions can be directly measured using our method. With the help of visualizations of the proximate receptive field entropy we are able to identify the layers in which the overfitting takes place and when it happens (C1). Sparsity heat maps are able to encode sparsity locally in the layers’ input space. Plotting the mean of several heat maps over all training epochs, we were able to identify which layers create overconfident responses when a network is overfitted.
The analysis of proximate receptive field entropy has shown that the use of common regularizers such as dropout or batch normalization exhibit higher entropies compared to the unregularized counterpart for all layers. Instead of maximizing entropy implicitly by nesting additional layers to the network, we have developed a loss-based regularizer that explicitly maximizes the proximate receptive field entropy. With our novel regularization we are able to utilize the potential of large networks to learn cooperative features, pushing NNs to a higher generalization (C4). Applying our regularizers we are able to outperform otherwise identical dropout and batch-normalized networks (C3). As a result of our visualization, we are able to identify problematic layers in particular and thus apply regularizations in a targeted manner. By regularizing NN only where it is needed, we maintain the highest accuracies throughout all our experiments. Using our regularizer, we are able to avoid overfitting for more than 200 epochs on datasets and network sizes, in which their unregularized counterparts start overfitting before the 10th epoch.
One potential risk of sparsity regularization (i.e. rewarding high activations across all filters) is that all filters converge towards the most salient feature and thus maximize the entropy. Using the cross-correlation between filters we could however demonstrate that targeted sparsity regularization (i.e. reduced sparsity values) do not induce correlations across filters when trained with SGD. Nevertheless, we demonstrated that such side effects could be addressed by decorrelation regularization if necessary. Moreover, DecorrReg has successfully been used to decorrelate the first layer which has further improved our network accuracy (cf. sr3Net). In summary we showed that especially a combination of targeted sparsity and decorrelation regularizers prevents overfitting and outperforms all experiments (C2).
In future work, further experiments will be carried out which, for example, find the best possible hyperparameters for regularizations. The impact of different activation functions, optimizers and other common NN hyperparameters also needs more testing. Finally, we plan to test our methods on other tasks besides image classification.