## 1 Introduction

Deep neural networks have made a remarkable progress in a variety of fields including computer vision, natural language processing, medical imaging, speech recognition, and computer graphics. Since many tasks in the fields require to understand high-level semantics, neural networks tend to go deeper and over-parametrized. Such deep and large networks are prone to overfitting so that a proper regularization becomes a critical factor in improving their generalization performance. A popular type of regularization for deep neural networks is to inject random noise into the networks during training,

e.g., applying a binary random mask to hidden activations (hinton2012improving) or weights (wan2013regularization), or skipping layers (huang2016deep) by forwarding activations via random identity connections. Due to its simplicity and effectiveness, the stochastic regularization is widely used for training deep neural networks.We propose a novel regularization technique referred to as StochasticBranch, which decomposes an ordinary linear layer into the one with multiple stochastic branches. By factorizing the original weight matrix of the layer into a set of matrices with their random binary masks, the stochastic branches effectively regularize a network during training. As a generalization of Dropout, its rich ensemble property with decomposed models allows to investigate exponentially many distinct models during training and explore diverse regions of the parameter space resulting in the better local optima. At inference time, the multiple branches collapse back into a single branch, thus requiring no additional complexity compared to the normal linear layers. Fig. 1 illustrates the comparison to Dropout (hinton2012improving) and Dropconnect (wan2013regularization). An extensive set of experiments shows the effectiveness of the proposed technique as well as wide applicability together with other popular regularizers including Dropout (hinton2012improving)

(Ioffe:2015:BNA:3045118.3045167).## 2 Related Work

Regularization is a common and essential technique to combat overfitting in training. While one common form of the techniques is to penalize the weight tensor with a constant

(NIPS1991_563; srebro2005rank), a popular method for deep neural networks is to inject random noise during training. The most well-known example is Dropout (hinton2012improving)which stochastically zero out activations of neural networks to avoid co-adaptation of neurons. Several successive follow-ups of Dropout have been proposed.

wan2013regularization propose a generalization of Dropout, called DropConnect, which zero-out weight rather than activation. ba2013adaptive propose adaptive Dropout, where drop rate of activation is determined by a binary belief network overlaid on the neural network. li2016improvedintroduce an efficient evolutionary Dropout that computes sampling probabilities on-the-fly from a mini-batch of examples.

bulo2016dropout develop Dropout distillation for better approximating the average predictor without sacrificing the computational efficiency of standard Dropout. kang2016shakeout propose Shakeout that randomly enhances or inverses contributions of each unit to the next layer, resulting in combination of L1 and L2 regularization. gal2016dropoutintroduce a theoretical framework for casting dropout as Bayesian inference to approximate uncertainty of neural networks.

zhai2018adaptive propose a framework to adaptively adjust the drop rates based on the Rademacher complexity bound.Other types of regularizers using stochastic noise have also been proposed to further improve generalization. zeiler2013stochastic introduce a stochastic pooling that randomly picks activation within each pooling region according to a multinomial distribution. smith2016gradual develop a dynamically growing neural network, Dropin, that gradually decreases the probability of skipping layers to train a network from shallow layers to deeper layers. huang2016deep propose to train a very deep ResNet network by stochastically skipping ResNet blocks via identity connections. ma2016dropout introduce expectation-linear Dropout that regularizes the training objective with a measured inference gap. noh2017regularizing reduce the gap between a marginal likelihood and a training objective with stochastic noise injection.

There exist recent methods that leverage multiple branches to improve generalization performance. lee2015m propose a multi-head ensemble learning that shares early convolutional layers. han2017branchout introduce a regularized ensemble method for single object tracking, which branches out intermediate layers to learn different target representations. Unlike our method, these approaches aim at proposing a specific type of architectures for ensemble learning rather than a generic regularization method for neural networks. goodfellow2013maxout

propose the Maxout activation function that merges outputs of branches of a single layer by max-pooling. It only helps the model with a stochastic regularizer better approximate ensemble results by model averaging, whereas our method itself is an effective stochastic regularizer.

## 3 Stochastic Branch

This section presents the details of StochasticBranch and relates the method to Dropout (hinton2012improving; srivastava2014dropout). For ease of explanation, we will only discuss a fully-connected (fc) layer in this section. Note, however, that StochasticBranch is applicable to any type of linear operators including convolution.

### 3.1 Stochastic Branch Layer

Let us consider an fc layer with input and output :

(1) |

where is a weight matrix and

is an element-wise nonlinear activation function such as ReLU and tanh.

We decompose the weight matrix into a sum of matrices, i.e., , , such that the pre-activation for each output unit is given by the sum of linear projections:

(2) |

This network structure can be interpreted as integrating multiple branches, where an output node is computed from the sum of branches.

In order to make the branches stochastic, we now introduce a random variable,

, to each branch in Eq. (2), thus resulting in(3) |

As shown in Figure 1, each output node of the stochastic branch layer is obtained using stochastic branch units. All input nodes of the layer are connected to the branch units, producing distinct values. The random binary masks then zero out a subset of the values, and the corresponding output activation is computed from the sum of the masked values.

Any linear layer parametrized by a weight matrix can be transformed to a StochasticBranch layer with branches, e.g., by setting the weight matrix of -th branch to () where the sum of those weights is . In training a neural network, the stochastic branch layers act as a regularizer. A set of random binary masks is sampled for each training example and used in both forward and backward passes. To update the weight matrices in the layer, e.g

., via stochastic gradient descent (SGD), only the branches that were active in the forward pass are updated. Note that this stochastic training procedure induces the branches to be distinctive from each other. We present the effect in Section

4.At inference time, instead of sampling random masks, we compute the output by taking the expectation of pre-activations using the following procedure:

(4) |

where

Note that the multiple units branched for training now merge back into a single unit; there is no additional computational cost for inference compared to an ordinary fc layer as shown in Figure 1d.

### 3.2 Generalized Dropout

StochasticBranch is a generalization of Dropout (hinton2012improving; srivastava2014dropout) and DropConnect (wan2013regularization), which can be shown below by imposing additional constraints on the StochasticBranch formulation.

If we impose a group masking constraint that an identical mask is used for all branches with the same output unit , i.e., for all , the multiple branches of Eq. (3) collapse into a single branch with a random mask variable:

(5) |

For any zero-centered activation function such as ReLU and tanh, we can move the mask variable out of the activation function so that it becomes equivalent to the Dropout regularizer (hinton2012improving; srivastava2014dropout):

(6) |

This shows that Dropout is StochasticBranch under the constraint of group masks. Dropout either removes or retains an entire activation, whereas StochasticBranch rejects parts of the activation by masking out a subset of decomposed weights in multiple branches.

If we impose a one-to-one branching constraint that each input is paired with exactly one branch, each input-output connection involves a mask variable. For example, consider branches where for and for . Then, StochasticBranch of Eq. (3) reduces to

(7) |

which is exactly the same form with another generalized Dropout called DropConnect (wan2013regularization):

(8) |

This in turn shows that DropConnect is StochasticBranch under the constraint of one-to-one branches. DropConnect (wan2013regularization) sample each input-output connection, whereas StochasticBranch maintains different weights across multiple branches and sample a branch, rather than a connection, for each unit.

As a generalization of Dropout and DropConnect, StochasticBranch plays the role of a strong stochastic regularizer as will be discussed in the following subsection. Note, however, that a combination with other methods is also possible and may become a better regularizer as an extension. For example, if the random mask of Dropout is added to StochasticBranch, the combination can be represented as

(9) |

which is a further generalization of StochasticBranch with additional group masking of Dropout. Dropout is designed to mitigate the problem of co-adaptation that neurons excessively rely on other neurons (hinton2012improving; srivastava2014dropout). By turning off neurons with probability of , Dropout encourages neurons to less co-adapt each other and induces the layer to produce sparse activations. In contrast, StochasticBranch activates neurons via combinations of branch outputs, and induces the layer to produce diverse activations across examples. And, its turn-off chance is significantly smaller than that of Dropout, which is probability of . Considering the differences, the two techniques may complement each other in practice. We will demonstrate such combination effects in Section 4.

### 3.3 Discussion

#### Ensemble learning.

From an ensemble learning point of view (hinton2012improving; baldi2014dropout), Dropout and its generalizations can be interpreted as learning an exponentially large ensemble of networks, where each model of the ensemble is given training examples in different orders via mini-batching during training. Each method approximates an ensemble from different classes of networks. Previous methods such as Dropout and DropConnect draw such models only within the original neural network. For example, Dropout approximates geometric ensemble averaging of models (baldi2014dropout) where denotes the number of units with Dropout. In contrast, StochasticBranch draws models from a richer class of networks augmented by branching so that the models in the ensemble are parameterized with different weights from distinct combinations of branches. It thus approximates an ensemble of models where and is the number of StochasticBranch units and branches, respectively. This rich ensemble with decomposed models allows to explore different regions of the parameter space and may find a diverse set of local optima.

#### Data augmentation.

Dropout can also be seen as an implicit form of sophisticated data augmentation increasing training data coverage (konda2015dropout), e.g., in case of images, translations, rotations, scaling, etc. Noise induced by a random mask results in a similar effect of augmenting data using a set of such transformations so that in the case of a single layer model. Here, a model in the Dropout ensemble can correspond to a transformation for data augmentation. In this perspective, StochasticBranch creates a larger set of fine-grained transformations by decomposing transformations of Dropout, resulting in an effect of richer data augmentation.

#### Batch normalization.

It has been widely known that Dropout and Batch Normalization are in disharmony each other in using them together. A recent research (li2018understanding)

shows that a cause of the disharmony is a variance shift of activations in Dropout between training and testing, and suggests to reduce the variance shift by placing Batch Normalization before random noise injection of Dropout. For the same reason, when Batch Normalization is used together with StochasticBranch, we place Batch Normalization before noise injection of StochasticBranch. Compared to Dropout, we observe that StochasticBranch has a lower variance shift, thus being more compatible with Batch Normalization. The corresponding experiments are reported in Section

4.1.#### Maxout.

Maxout networks (goodfellow2013maxout) also have a similar branching structure where a layer merges outputs of multiple branches. Despite its apparent similarity to StochasticBranch, its goal and structure are significantly different from ours. Maxout is a non-linear activation function by max-pooling that is designed to improve ensemble approximation with stochastic regularizers by model averaging. It can thus also be used together with the proposed stochastic regularizer. Moreover, all branches of Maxout networks need to maintain their parameters even at inference due to the max-pooling operation, whereas multiple branches of StochasticBranch merge back into a single unit at inference.

#### Time Complexity.

SB introduces additional time complexity within only a few layers where SB is applied. Therefore, the increase of overall complexity is not significant in most cases. For instance, the use of SB on ResNet-110 in our experiments (refer to Table 4) increases only 3.62% of time complexity ( 2.48G vs. 2.57G flops). Note that SB increases time complexity only during training while the inference complexity remains the same as ordinary networks.

MNIST (Error/stdev) | FMNIST (Error/stdev) | |||||
---|---|---|---|---|---|---|

MLP3 | MLP5 | CNN | MLP3 | MLP5 | CNN | |

Baseline | 1.79 / 0.06 | 1.95 / 0.11 | 0.88 / 0.02 | 10.08 / 0.16 | 10.12 / 0.16 | 8.34 / 0.15 |

+DO | 1.46 / 0.03 | 1.72 / 0.07 | 0.68 / 0.03 | 9.44 / 0.05 | 9.96 / 0.19 | 7.65 / 0.17 |

+BN | 1.54 / 0.06 | 1.58 / 0.06 | 0.74 / 0.05 | 10.04 / 0.20 | 9.68 / 0.16 | 9.65 / 0.13 |

+DO+BN | 1.42 / 0.05 | 1.42 / 0.04 | 0.74 / 0.05 | 9.37 / 0.12 | 9.55 / 0.18 | 9.05 / 0.13 |

+SB | 1.47 / 0.03 | 1.55 / 0.05 | 0.73 / 0.04 | 9.60 / 0.14 | 9.64 / 0.03 | 8.04 / 0.17 |

+SB+DO | 1.30 / 0.02 | 1.34 / 0.03 | 0.63 / 0.03 | 9.18 / 0.08 | 9.52 / 0.04 | 7.66 / 0.06 |

+SB+BN | 1.25 / 0.03 | 1.19 / 0.02 | 0.45 / 0.03 | 9.25 / 0.09 | 9.19 / 0.16 | 7.36 / 0.20 |

Averaged classification error [%] and standard deviation on MNIST and FMNIST with five runs.

## 4 Experiments

We evaluate StochasticBranch on multiple image classification benchmarks. In the experiments, our method (SB) is compared with two of the most popular regularization methods: Dropout (DO) and Batch Normalization (BN). We set the drop rates of SB and DO to 0.5, and use 10 branches () unless specified otherwise.

### 4.1 MNIST and Fashion-MNIST

We first conduct a set of experiments on MNIST (lecun1998gradient) and Fashoin-MNIST (FMNIST) (xiao2017fashion). Both benchmarks consist of grayscale images with

class labels. MNIST classes represent digits between 0 and 9, whereas FMNIST classes correspond to fashion items. In both benchmarks, training and test sets contain 60,000 and 10,000 examples, respectively. We test our method on multi-layer perceptrons (MLP) and convolutional neural networks (CNN). We implement two MLPs with 3 and 5 layers (MLP3 and MLP5) where each intermediate layer has 1,024 hidden units. CNN consists of two convolution (

conv) layers with kernels and two fc layers. The output channels of the first and second conv layers are set as 32 and 64, respectively, while the number of hidden activations of the first fc layer is 1,024. In every network, we use ReLU function for intermediate activations. We train the three models without any regularization techniques as our baselines to reveal the improvements by regularization methods. SB is applied to every layer of MLP3 and CNN. For MLP5, we apply the technique to the first, third and fifth layers. BN is applied to pre-activations of every layer and DO is placed in between every successive fc layers.Table 1 summarizes classification errors of the models with different regularization techniques on MNIST and FMNIST, where all results are obtained by averaging the errors of five independent runs. The proposed method reduces the baseline errors in all settings comparable to or often better than other techniques. Notably, our regularizer further reduces the errors when combined with other regularizers, achieving the largest error reductions in all settings. Interestingly, Table 1 reveals that the error reduction of SB with BN is significantly larger than that of DO with BN. This is because SB has a lower variance shift compared to DO. To quantify variance shifts of SB and DO, we measure variance shift ratio (VSR),which is given by

(10) |

where and represent the activation variances for each neuron at training and test time, respectively.

Note that the ratio of 0 is the ideal case where there is no variance shift. Figure 2 presents VSR distribution of hidden activations of MLP3 on MNIST test set. The figure clearly shows that activations of SB has lower variance shifts than DO. In average, SB has VSR of 0.21 while DO shows that of 0.62.

In addition, we conduct experiments with Maxout to show that it is distinct from and complementary to StochasticBranch as discussed in Section 3.3. We test DO and SB models of MLP3 with Maxout on MNIST by replacing the non-linearity function of the stochastic layers. The use of Maxout further reduces the classification error by 0.04% and 0.14% for DO and SB, respectively. Note that the additional error reduction by Maxout is larger with SB than with DO.

Figure 3 shows the effects of SB, DO, and SB+DO in terms of activation statistics on MNIST test set. The statistics are measured with the activations at the second fc layer of MLP3. We present two histograms for each method: (1) histogram of mean activation of each neuron and (2) histogram of the number of active neurons for each image. A neuron with zero mean activation is a dead neuron, and an image with a small number of active neurons corresponds to a case with sparse representations. Note that if the weights of a neuron converge to a point where its preactivation is severely biased to a negative value, the neuron may become near-dead and almost never activates (maas2013rectifier). The comparison between the baseline (Figure 3a) and DO (Figure 3b) shows that DO reduces near-dead neurons as well as excessively-active neurons, and also induces sparse activations. The similar effects have been observed in the work of (srivastava2014dropout). Interestingly, Figure 3c shows that SB is significantly more effective in reducing near-dead neurons than DO. Since the process of stochastic branching generates exponentially many weight combinations, some combinations without negative bias may allow dead neurons to activate again by receiving the gradient during training. Note that dead-neurons with ReLU is not able to be active again in both the baseline and DO since the dead-neurons never receive the gradient signal during training. As a side effect, the input features become denser as indicated by a large number of active neurons per image. Due to these two different aspects (DO encouraging sparse representation and SB focusing more on reducing dead neurons), SB and DO may be complementary with each other. Figure 3d demonstrates that the combination of SB+DO lowers the number of active neurons per image and retains sparse activations while reducing near-dead neurons.

Figure 4

shows average cosine similarity of weight vectors between different branches, which is measured over epochs. For this experiment, we train three instances of MLP5 with different number of hidden units (64, 256, 1,024) applying SB to all layers. We observe significantly low cosine similarities between weights of branches in Figure

4a; this confirms that the branches of SB learn distinctive patterns and are capable of exploring various regions of the parameter space. As we decrease the number of hidden units in the layers as in Figure 4b and 4c, the branches become more similar because it is difficult to decompose a pattern with a small number of output units into diverse yet useful patterns. The redundant patterns across branches in these small networks bring lower regularization performance, resulting in relatively smaller gains. For example, the accuracy gain (0.17%) of the model with 64 activations is smaller than those (0.64% and 0.35%) of the models with 1024 and 256 activations. This implies that SB may perform better in layers with more output neurons. Another observation is that the average similarity between branches tends to decrease when SB is applied to deeper layers. The branches at the last layer (fc5) have particularly high similarity in all three settings since the last layer has only 10 output units corresponding to the number of classes.### 4.2 CIFAR-10 and CIFAR-100

To validate the proposed method on more realistic settings, we conduct more experiments on CIFAR-10 and CIFAR-100 (krizhevsky2009learning), which contain images with 10 and 100 object classes, respectively. The sizes of training and test split are 50,000 and 10,000. For experiments on CIFAR-10, we build a custom CNN that consists of two conv, one maxpool, two conv, one maxpool, and two fc layers. In this experiment, SB and DO are applied to fc layers. BN is applied to all conv layers as well as fc layers, which produces a larger performance gain of BN. Table 2 summarizes the classification errors of the models on CIFAR-10, which show the similar tendency as on MNIST and FMNIST. To study the effect of varying the number of branches, we also train another set of networks whose the first fc layers are replaced by SB layers with 2, 4 or 8 branches.

We plot the curves of training loss and the test accuracy in Figure 5a and 5b, respectively. While SB injects stochastic noises diversifying hidden units, we notice that more branches help the network converge better compared to SB with fewer branches. More branches create a richer ensemble of SB during training to better exploration of the parameter space, resulting in a higher chance to converge faster.

Error/stdev | |
---|---|

Baseline | 14.53 / 0.14 |

+DO | 13.92 / 0.23 |

+BN | 11.99 / 0.22 |

+DO+BN | 12.00 / 0.08 |

+SB | 13.84 / 0.14 |

+SB+DO | 13.82 / 0.24 |

+SB+BN | 11.83 / 0.29 |

Baseline | +DO | +SB | +SB+DO | |

Error | 60.68 | 59.63 | 56.14 | 55.24 |

To show the richer data augmentation effect of SB discussed in Section 3.3, we conduct additional experiments on CIFAR-10 with fewer training examples. We train models with 1% of randomly sampled training examples for each class. Table 3 shows the classification errors in this setting. As shown in Table 3, SB reduces the error more significantly than DO does where the relative error reduction of DO is 1.7% whereas that of SB and SB+DO are 7.5% and 9.0%, respectively. These results imply that SB has a stronger data augmentation effect compared to DO.

For CIFAR-100, we train two advanced convolutional neural network architectures, ResNet-110 (he2016deep) and MobileNetV2 (sandler2018mobilenetv2). For ResNet-110, both DO and SB are applied on the third conv layer of the last bottleneck block. For MobileNetV2, both DO and SB are applied on the depthwise conv layer of the last two inverted residual blocks. The results are summarized in Table 4. While DO performs similarly to the baseline, SB outperforms these models in both networks. These results show the effectiveness of SB in conv layers. It is worth noting that SB is more effective for layers with a large number of channels or a large kernel size. It is the reason why SB is used to the conv layers in the last blocks in these experiments.

Error/stdev | |
---|---|

ResNet-110 (Baseline) | 24.29 / 0.19 |

ResNet-110 +DO | 24.28 / 0.97 |

ResNet-110 +SB | 23.41 / 0.34 |

MobileNetV2 (Baseline) | 27.82 / 0.39 |

MobileNetV2 +DO | 27.63 / 0.25 |

MobileNetV2 +SB | 27.16 / 0.32 |

### 4.3 Pascal Voc

In following experiments, we apply SB to pretrained networks . To apply SB on a pretrained layer, We initialize the weight matrix of the branches by where is the pretrained weight matrix. Note that although the weight matrices are initialized by scaling the same pretrained weights , the stochastic branch layer is still capable of regularizing the network since each branch observes different training samples, and in consequence, the weight vectors of different branches become dissimilar with each other.

We conduct experiments on multiple tasks using PASCAL VOC (Everingham10)

benchmarks: multi-label object classification, object detection, and semantic segmentation. For evaluation metrics, mean average precision (mAP) is used for classification and detection while mean intersection over union (mIoU) is for semantic segmentation. For all the tasks, ImageNet-pretrained VGG-16

(DBLP:journals/corr/SimonyanZ14a) are used as our backbone networks, and SB is applied to fc7 and DO is placed after the non-linear activations of fc6 and fc7.Validation mAP/stdev | Test mAP | |
---|---|---|

Baseline | 84.47 / 0.09 | 84.27 |

+DO | 85.54 / 0.11 | 85.38 |

+SB | 86.19 / 0.02 | 86.04 |

+SB+DO | 86.50 / 0.03 | 86.36 |

For object classification, we finetune the pretrained network on the training set after replacing the classification layer (fc8) to match the number of classes in PASCAL VOC; the trained models are evaluated on both validation and test sets. Table 5 shows mAPs of the multi-label classification models on PASCAL VOC 2012. As seen in other experiments where we train networks from scratch, our method effectively improves the baseline performance by regularizing the network even though the weight vectors of branches are equally initialized.

Baseline | +DO | +SB | |

mAP | 70.17 | 70.36 | 71.41 |

Baseline | +DO | +SB | |

mIoU | 62.72 | 62.84 | 63.34 |

For object detection and semantic segmentation tasks, which require are more complex and structured prediction, we use Faster-RCNN (ren2015faster) and FCN-32S (long2015fully), respectively, as the original architectures are equipped with DO as a regularizer. Table 6 shows mAPs of the object detection models that are trained using training and validation sets of PASCAL VOC 2007 and evaluated on test set. SB effectively regularizes the object detection network and outperforms both baseline and DO. Note that Faster-RCNN architecture not only predicts class labels but also regresses bounding boxes. Table 7 presents the results of PASCAL VOC semantic segmentation task. It also shows that SB is also effective for semantic segmentation resulting in an improved mIoU.

## 5 Conclusion

In this paper, we proposed a novel regularization technique called StochasticBranch, which is a generalization of Dropout. During training, the proposed method factorizes a single layer into multiple branches and sums up the outputs of the branches after random masking by binary noises. At inference time, the multiple branches are merged back into a single layer. We investigated that the proposed method regularizes the various neural networks on multiple benchmarks successfully and achieves significant improvement over the baseline performances. Moreover, a set of experimental results show that our method can be applied with other commonly used regularizers such as Dropout or batch normalization achieving even further performance improvement.

This work is supported by Samsung Advanced Institute of Technology and by Basic Science Research Program (NRF-2017R1E1A1A01077999) through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT.

Comments

There are no comments yet.