Block-Cyclic Stochastic Coordinate Descent for Deep Neural Networks

11/20/2017 ∙ by Kensuke Nakamura, et al. ∙ 0

We present a stochastic first-order optimization algorithm, named BCSC, that adds a cyclic constraint to stochastic block-coordinate descent. It uses different subsets of the data to update different subsets of the parameters, thus limiting the detrimental effect of outliers in the training set. Empirical tests in benchmark datasets show that our algorithm outperforms state-of-the-art optimization methods in both accuracy as well as convergence speed. The improvements are consistent across different architectures, and can be combined with other training techniques and regularization methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The two workhorses of Deep Learning, responsible for much of the remarkable progress in traditionally challenging Computer Vision problems, are SGD (stochastic gradient descent) and GSD (graduate student descent). The latter has produced an ever-growing body of neural network architectures, starting from basic shallow convolutional ones 

[22] to non-Markov ones [9, 10, 2], and still growing deeper [14, 3, 15]. The former has been the subject of intense scrutiny, despite its simplicity, both in terms of unraveling the mysteries behind its unreasonable effectiveness, as well as fostering a cottage industry of modifications and improvements. Our work is squarely in the latter vein.

SGD [29, 31, 45] is a simple variant of classical gradient descent where the stochasticity comes from employing a random subset of the measurements (mini-batch) to compute the gradient at each step of descent. This has computational cost of

in the total example size, that is usually in the tens of thousands to millions. It also has implicit regularization effects, making it suited for highly non-convex loss functions, such as those entailed in training deep networks for classification.

The entire process is sensitive to outlier data such as erroneous labeling in the training set, as each mini-batch affects the update of the entire set of parameters. The mini-batch size is usually small, thus the relative impact of an outlier can be large compared to the full batch gradient. There are a number of techniques such as adaptive learning rate, regularization, and some gradient descent designed for weakening the impact of outliers, but they aim to normalize the variation of mini-batches and cannot manipulate training outliers explicitly. Stochastic methods of such as randomized block coordinate descent (SBC) [40, 47, 39], on the other hand, trade off accuracy with robustness to noise. Our objective is to develop an accurate optimization algorithm for deep learning that is not subject to such a strict tradeoff.

In the proposed algorithm, named BCSC, we leverage randomized methods based on stochastic randomized block coordinate descent [40, 47, 39], but introduce a cyclic constraint in the selection of both measurements and model parameters, so that different mini-batches of data are used to update different subsets of the unknown parameters. We perform numerical experiments using neural networks from shallow to recently developed deeper models based on popular benchmark sets, and demonstrate that our algorithm consistently outperforms the state-of-the-art optimization techniques for all the network models under consideration.

In Sect. 2 we place our contribution in context, and provide the problem of interest and relevant algorithms in Sect. 3. The technical details on the proposed algorithm are presented in Sect. 4. In Sect. 5 we report experiments to compare with the state-of-the-art, and discuss limitations and potential extensions in Sect. 6.

2 Related Work

Adaptive step size methods

   In SGD, the current parameter estimate is updated by subtracting the (approximate) gradient multiplied by a factor, the

learning rate. Since SGD does not converge to a point estimate, the learning rate usually decreases over iterations monotonically to reduce fluctuation. While it is still common in practice to modulate the learning rate based on a fixed schedule, several adaptive learning functions have been studied to automate the scheduling [7]. Some of the best known methods include AdaGrad [5] and AdaDelta [33, 43]. They reduce the learning rate by accumulating the gradient of the loss function globally [5] or parameter-wise [33, 43]

. For the adaptive scheduling of the learning rate, the interpolation with a random sampling technique has been used to compute the step size 

[37, 4]

. In an effort to reduce the variance of gradients, adaptively changing the mini-batch size has been introduced in 

[4]. Our approach is not directly aimed at acceleration, and can be used in conjunction with an adaptive step-size selection. However, as we will show empirically, it outperforms adaptive step size methods in terms of both convergence speed and overall accuracy.

Regularization methods    There are a number of ways to impose regularity to the model in order to improve generalization for better prediction, among which are data augmentation [1, 34]

, batch normalization 

[17, 12], or dropout [11, 35, 39, 16]. One can also incorporate regularization in the network architectures, including pooling [20], maxout [8], or skip connections [24, 15]. There is also an explicit regularization that is integrated with the objective function with classical weight decay [26, 21], lasso [38], group lasso [42], or Hessian [28]. Our method acts in concert, not in alternative, to other forms of regularization.

Variants of gradient descent    Stochastic average gradient (SAG) [30] calculates the gradient using a randomly-chosen subset of the examples and then averages their gradients in the estimation of the full gradient. Stochastic variance reduced gradient (SVRG) [18] considers the inherent variance of the gradient or the difference between the gradients of a mini-batch and the full gradient. Both SAG [30] and SVRG [18] are approximations of the standard gradient and would be subject to its same limitations in large scale optimization problems for non-convex objective functions. A variety of first-order stochastic algorithms have been developed for parallel computation [44] or proximal operators [6]. Similar to SGD that randomly selects subsets of data, stochasticity has been applied to select subsets of parameters to update by randomized block coordinate descent (BCD) [25, 27]. Such a technique has been used to train neural networks in [23] utilizing parallel computation.

Our algorithm is closely related to stochastic (randomized) block coordinate descent (SBC) [40, 47, 39]

, which randomly chooses both parameters and examples in the optimization procedure. However, when the number of parameters is in the millions, there is a tradeoff between accuracy and robustness to outliers. To mitigate this issue, we introduce a cyclic procedure such that a parameter is updated only once with each sample within an epoch. This is, however, different from classical cyclic coordinate descent 

[32], since we consider mini-batches of both the data and the parameters. Furthermore, our goal is not to approximate the full gradient, as in [40, 47]. Instead, we aim to modify the stochastic procedure to achieve faster convergence and better regularization, hence better accuracy.

3 Preliminaries

Let be a set of training data where is an input, typically an image, and is an output, typically a label. Let be a prediction function with the associated model parameter where the dimension of the feature space is . The discrepancy between the predicted output and the true output is measured by a loss function for each training sample . The goal is to find optimal parameters that are typically obtained by minimizing the empirical loss on the dataset :

(1)
(2)

where the loss incurred by the parameter with sample is denoted by .

3.1 Stochastic Gradient Descent

The minimization of in Eq. (1), assuming is differentiable, involves the computation of the gradient for a large number of training data. Stochastic gradient descent (SGD) [29, 31, 45] achieves the dual objective of reducing the computational load as well as improving generalization due to the implicit regularization effect. The stochastic process of sampling subsets of data at each iteration leads to regularization in the estimation of the gradient for the expected loss. Let be the index set of the training data and be its random subset, called the mini-batch. SGD updates an initial estimate (typically random) of the weights recursively at each iteration via

(3)

where is a positive scalar, called learning rate. Manual scheduling of the learning rate is typical, although adaptive scheduling schemes based on the gradient or the iteration are also considered [5, 43, 33].

3.2 Random Coordinate Descent

In the optimization of deep neural networks, it is often required to compute loss function with respect to a large number of parameters in addition to dealing with a large number of data. Randomized block coordinate descent (BCD) [25, 27] selects a subset from the index set of the feature space uniformly at random and computes gradients restricted to the selected subset of the coordinates using the set of loss functions on the whole data set. Then, the only selected parameters are updated based on the gradient . The BCD algorithm proceeds at each iteration via

(4)

3.3 Stochastic Random Coordinate Descent

It is natural to consider combining the use of random mini-batches of data as done by SGD in Sect. 3.1 with random subsets of coordinates as done by BCD in Sect. 3.2. The resulting algorithm, called stochastic random block coordinate descent (SBC) [40, 47, 39], selects subsets of the training data uniformly at random and computes the gradient of the objective function with respect to a randomly chosen subset of the parameters:

(5)

While it is reasonable to expect that the regularizing effects of mini-batching would compound the computational benefits of block-descent, it is less obvious that connecting the random selection so that different sets of data are used to update different set of parameters would be beneficial. Enter our proposed algorithm, which we derive in the following section.

4 Block-Cyclic Stochastic Coordinate Descent

The essential motivation of our proposed algorithm is to combine the two types of algorithms, SGD and BCD, in such a way that SGD is designed to feed random subsets of data in the computation of gradient and BCD is designed to select random subsets of parameters to update. The combination of the two stochastic processes allows to use different subsets of data to update different subsets of parameters. We also introduce a constraint that allows the algorithm to end up using all the training example data to update each of model parameters and updating all the parameters at each epoch. We call the resulting algorithm block-cyclic stochastic coordinate descent (BCSC), which entails a doubly-stochastic process with randomization of both mini-batches of data and parameter blocks based on the cyclic block structure.

4.1 Cyclic Block Structure

We model the block structure of coordinates by decomposing the feature space into subspaces. Let be a random permutation of the identity matrix and be a decomposition of into a set of column blocks with of size , where

. For a random selection of the elements from a feature vector with all the other elements being zero, we define a random selection matrix

with size for a column block of the permutation matrix as follows: where

is a zero matrix with size of

. For a given feature vector , it can be uniquely written as . The iterative selection of a parameter block from the elements in over considers all the elements in exhaustively with being mutually disjoint across blocks.

4.2 Dual Cyclic Stochastic Process

In the optimization procedure, one can consider a single stochastic process in the selection of mini-batch with a given cyclic block structure for a random grouping of elements in parameter vector , namely random block coordinate descent (RBC) algorithms, where the same mini-batch is used to update all the sequential blocks of parameters in an iterative way. The RBC algorithm iterates over each with a fixed as follows:

(6)
(7)

where denotes the gradient of the objective function based on a mini-batch , and it is assumed that is the whole index set of the training data and mini-batches are mutually disjoint if . However, this approach is ineffective in the presence of outliers that may corrupt the estimation of the gradient for the entire set of parameters.

The algorithm we propose, called block-cyclic stochastic coordinate descent (BCSC), is developed based on the dual cyclic stochastic process within the selection of both mini-batch from the training set and coordinate block from the parameters. It is designed to ensure that each random block of the parameters is updated following the independent stochastic selection of mini-batch . In addition, each element in the training data ends up being used to update all the parameters within an epoch. Our BCSC algorithm proceeds with the dual stochastic process to select both and as follows:

(8)
(9)

subject to is the whole index set of the training data and if at fixed . Note that is randomly generated at every epoch and the index sets of the training data are also randomly shuffled at every epoch.

4.3 Our Proposed Algorithm

The central idea of the algorithm we propose is to use different subsets of data (mini-batches) to update different subsets (blocks) of parameters. This is illustrated in Fig. 1, where the same mini-batch is used to update all the parameters in SGD (left) and different mini-batches are used to update different blocks of parameters in BCSC (right).

More details are described in Algorithm 1 where is a given number of partitions in the parameters. The algorithm proceeds with the initialization for the index sets of training data and for the permutation matrix at each epoch. Then, different mini-batches are taken from the data to update different blocks of parameters followed by the update of the index set by excluding the mini-batch from .

SGD BCSC (Ours)
Figure 1: Illustration of the proposed algorithm SGD simultaneously updates all the parameters using the same mini-batch . On the other hand, our BCSC uses different mini-batches to update different blocks of parameters .
for all epoch do
     : index sets of data by random shuffling.
     : random permutation matrix.
    for all  : index for mini-batch do
       for all  : index for parameter block do
          Take mini-batch from .
          Take parameter block using .
          Compute gradient of the loss to using .
          Update parameter block using Eq. (9).
          Update index set .
       end for
    end for
end for
Algorithm 1 Block-Cyclic Stochastic Coordinate Descent

5 Experimental results

We provide quantitative and qualitative evaluation of our algorithm in comparison to the state-of-the-art optimization algorithms on the datasets including MNIST [22], Cifar10 [19] and Cifar100 [19]. MNIST consists of training and testing images with labels. Cifar10 and Cifar100 are more challenging datasets that consist of training and test data with and labels, respectively.

In order to provide better understanding on the effectiveness and robustness of our algorithm, we consider a variety of neural networks raging from simple to deep and wide models; LeNet4 [22], VGG19 [34], GoogLeNet [36], ResNet18 [9, 10], ResNeXt29 [41], MobileNet [13], ShuffleNet [46], SENet18 [14], DPN92 [3], and DenseConv [15].

The performance of our BCSC algorithm is compared with other state-of-the-art optimization algorithms including AdaGrad (AG) [5], AdaDelta (AD) [43, 33], stochastic gradient descent (SGD), stochastic randomized block-coordinate descent (SBC) [40, 47, 39], and randomized block-coordinate descent (RBC) in Sect. 4.2

. For each experiment we provide the learning curves that consist of the training loss, the test loss, and the test accuracy. In addition, the standard deviation of the training loss computed from the mini-batches within each epoch is also presented. The learning curves are shown in colors; training loss in blue, test loss in red, and test accuracy in green, and they are plotted in log scale. The percentile loss and the percentile accuracy are displayed with respect to the left vertical axis and the percentile accuracy is displayed with respect to the right vertical axis. As quantitative comparison, the test accuracy is computed within the first half epochs, the last half epochs, all the epochs, and the final epoch.

For the selection of the hyper-parameters associated with the optimization algorithms, we use the customary values; mini-batch size is , momentum is , weight decay is , and the total number of epochs is . For the learning rate, we employ a manual scheduling that is known to be effective; for epochs , for epochs , and for epochs , so that the staircase effect appears in the learning curve in which it is noted that the horizontal axis for epoch iteration is in log-scale. These values are applied to all the algorithms throughout the experiments unless mentioned otherwise. For a fair comparison, the same values are used for the common hyper-parameters among the algorithms.

(SGD)
(a) LeNet4 [22]
(SGD)
(b) VGG19 [34]
(SGD)
(c) ResNet18 [9, 10]
Figure 2: Effect of the number of parameter blocks Learning curves optimized by our BCSC with varying number of parameter blocks on Cifar10. BCSC with is equivalent to SGD.

Effectiveness of the number of parameter blocks    We initially design the experiment to validate the behavior of our algorithm as a function of the number of parameter blocks . Thus, we compare our BCSC with varying against SGD () using the models LeNet4, VGG19, and ResNet18, on the Cifar10 dataset. In this experiment, we use a fixed learning rate of across all the epochs to better understand the behavior of BCSC in comparison to SGD. The learning curves obtained from different network models are presented with varying number of parameter blocks in Fig. 2 where it is clearly observed that both training and testing losses (red and blue lines) are significantly improved with increasing in particular with deeper network models where the number of parameters is large, resulting in a notable improvement of the test accuracy (green line). It is also observed that the convergence speed becomes faster and the variation of the training loss decreases earlier with increasing .

Training Outlier (%) 0 5 10 15
SGD () 57.91 53.43 52.80 51.79
BCSC () 67.80 66.31 64.45 64.98
BCSC () 71.72 71.32 70.73 70.12
BCSC () 73.88 73.64 73.56 73.21
Table 1: Test accuracy with varying degree of outliers (%)

Robustness to training outliers    To demonstrate the robustness of our BCSC to training outliers in comparison to SGD, we compute test accuracy with parameter blocks = in the presence of arbitrarily corrupted training data with varying rate of outliers from % (original), %, %, % based on the model LeNet4 with the Cifar10 dataset. Table 1 presents the average testing accuracy over epoch and clearly demonstrates the relative benefit of using our BCSC with increasing against the training outliers. SGD is shown to be more sensitive to outliers, whereas BCSC is essentially unaffected up to the percentage of outliers tested.

AdaGrad + SGD AdaGrad + BCSC AdaGrad + SGD AdaGrad + BCSC
(a) VGG19 [34] (b) ResNet18 [9, 10]
Figure 3: Comparison of BCSC with SGD when using with adaptive learning rate (AdaGrad) Learning curves obtained by SGD with AdaGrad and BCSC with AdaGrad using Cifar10. = 8 is used for BCSC.

Results with adaptive learning rate    In order to demonstrate that the benefits of BCSC are not diminished when using an adaptive learning rate, we compare BCSC with = 8 and SGD when integrated with the learning rate given by AdaGrad [5] based on the basic models: VGG19 and ResNet18, using the Cifar10 dataset. The learning curves are presented in Fig. 3 where the training loss and the test loss are noticeably improved with BCSC in comparison to SGD. The results indicate that BCSC outperforms SGD consistently, regardless of whether an adaptive learning rate scheme by AdaGrad is applied to the algorithm.

Dropout rate (%) 0 5 10 15
SGD () 98.98 99.04 99.09 99.04
BCSC () 99.00 99.10 99.14 99.13
BCSC () 99.04 99.10 99.17 99.16
BCSC () 99.02 99.13 99.17 99.19
Table 2: Test accuracy with varying degree of dropout (%)

Results with dropout    In this experiment, we demonstrate that the regularization effects of BCSC persist if additional regularization is employed, for instance using Dropout. We employ a simple network model, LeNet4, in which we can easily observe the effect of dropout using the MNIST dataset. Table 2 summarizes the average test accuracy over epoch of BCSC with parameter blocks = in comparison to SGD () at different rates of dropout (). It is shown that BCSC outperforms SGD regardless of Dropout even though the effectiveness of a larger number of parameter blocks is shown to be weaker, which is due to the relatively small number of parameters in the network model.

SGD SBC RBC BCSC
(a) LeNet4 [22]
SGD SBC RBC BCSC
(b) VGG19 [34]
SGD SBC RBC BCSC
(c) ResNet18 [9, 10]
Figure 4: Evaluation on Cifar10 Learning curves optimized by (SGD) stochastic gradient descent, (SBC) stochastic randomized block-coordinate descent, (RBC) randomized block-coordinate descent, and (BCSC) our algorithm with .
(a) First half epochs (b) Last half epochs (c) All epochs (d) Final epoch
AG AD SGD SBC RBC BCSC AG AD SGD SGD SBC BCSC AG AD SGD SBC RBC BCSC AG AD SGD SBC RBC BCSC
LeNet4 52.79 65.89 46.98 64.32 54.34 70.49 62.15 69.37 70.33 72.90 72.74 77.17 57.47 67.63 58.66 68.61 63.54 73.83 62.12 69.64 73.24 73.80 75.25 77.61
VGG19 82.42 86.60 75.28 85.40 80.09 89.22 89.07 91.55 92.33 92.58 92.57 93.70 85.75 89.07 83.81 88.99 86.33 91.46 89.16 91.86 93.62 92.69 93.58 94.09
ResNet18 58.43 87.58 79.43 87.01 81.34 90.64 78.10 91.98 93.89 93.76 93.74 95.12 68.26 89.78 86.66 90.38 87.54 92.88 82.20 92.40 94.90 93.87 94.34 95.19
Table 3: Test accuracy for Cifar10 (%)
SGD BCSC SGD BCSC
(a) GoogLeNet [36] (b) DPN92 [3]
Figure 5: Deep models on Cifar10 Learning curves optimized by SGD and BCSC with = .

Results with Deep models on Cifar10    We compare the performance of our BCSC against other state-of-the-art optimization methods including AdaGrad (AG), AdaDelta (AD), stochastic gradient descent (SGD), stochastic randomized block-coordinate descent (SBC), and randomized block-coordinate descent (RBC). In this comparative analysis, we provide the learning curves and the accuracy table based on the network models including LeNet4, VGG19, ResNet18, GoogLeNet and DPN92 using the Cifar10 dataset. The experimental results for BCSC are obtained with = , which is chosen as an example, but the results with other values for agree with the effectiveness and robustness of the number of parameter blocks as demonstrated by previous experiments. The learning curves obtained by different optimization algorithms, SGD, SBC, RBC and BCSC, based on different network models, LeNet4, VGG19 and ResNet18, are presented in Fig. 4 where BCSC outperforms all the other algorithms in accuracy, stability and convergence speed regardless of the network models. For more extensive comparison, the learning curves are obtained by SGD and BCSC based on deeper network models, GoogLeNet and DPN92, using the Cifar10 dataset and they are presented in Fig. 5 where the performance of BCSC is shown to be significantly better than SGD. In addition to the comparison by the learning curve, we provide quantitative evaluation of the test accuracy computed within (a) the first half epochs, (b) the last half epochs, (c) all the epochs and (d) the final epoch in Table 3 and 4. These experimental results indicate that our BCSC algorithm outperforms all five state-of-the-art optimization methods irrespective of the architecture and the depth of the models not only by the final accuracy, but also by the convergence speed.

SGD BCSC SGD BCSC SGD BCSC
(a) MobileNet [13] (b) ShuffleNet [46] (c) VGG19 [34]
SGD BCSC SGD BCSC SGD BCSC
(d) ResNet18 [9, 10] (e) SENet18 [14] (f) DenseConv [15]
SGD BCSC SGD BCSC SGD BCSC
(g) ResNeXt29 [41] (h) GoogLeNet [36] (i) DPN92 [3]
Figure 6: Deep models on Cifar100 Learning curves optimized by SGD and BCSC with = .

Results with Deep models on Cifar100    We now further validate the performance of BCSC in comparison with SGD using the more challenging Cifar100 based on the models including LeNet4, MobileNet, ShuffleNet, VGG19, ResNet18, SENet18, DenseConv, ResNeXt29, GoogLeNet, and DPN92. In this experiment, we use = due to the heavy computational cost required to optimize deep models using the Cifar100 dataset. The learning curves are obtained by the algorithms, SGD and BCSC with = , based on different network models, and they are presented in Fig. 6 where better, faster, and more stable results are observed with BCSC irrespective of the architecture albeit the minimum partition number is used. The quantitative evaluation of BCSC in comparison to SGD is provided in Table 5 where the testing accuracy is computed within (a) the first half epochs, (b) the last half epochs, (c) all the epochs and (d) the final epoch. These experiments further confirm that BCSC outperforms standard SGD irrespective of the architecture of the models in accuracy, stability, and convergence speed. It is also noted that the effectiveness of the algorithm can be demonstrated even with the minimum number of groupings in the model parameters. The performance of BCSC is consistently improved with increasing number of parameter blocks in particular with deep network models where the number of parameters is large.

6 Discussion

We have presented a first-order optimization algorithm for large scale problems in deep learning when both the number of training data and the number of model parameters are large, and when the training data is polluted with outliers. The proposed algorithm, named BCSC, is based on the intuition that different subsets of data being used for updating different subsets of parameters is beneficial in handling outliers. The experimental results based on the state-of-the-art network models with the standard datasets indicate that the proposed dual stochastic process with the block-cyclic constraint leads to improved robustness to outliers in the training phase. In addition, it has been empirically demonstrated that our algorithm outperforms the state-of-the-arts in optimizing a number of recent deep models in terms of accuracy, stability and convergence speed. Our algorithm can be naturally extended to distributed and parallel computation, so as to mitigate the added computational cost due to the dual stochastic process. Additional variants to the sampling and circulant schemes, as well as hyper-parameter tuning and determination of the optimal parameter-batch sizes, are also subject of future work.

Epoch (a) First half (b) Last half (c) All (d) Final
SGD BCSC SGD BCSC SGD BCSC SGD BCSC
GoogLeNet 77.66 89.97 94.04 95.56 85.85 92.77 94.78 95.61
DPN92 80.53 91.15 94.50 95.24 87.51 93.20 95.38 95.46
Table 4: Test accuracy of deep models for Cifar10 (%)
Epoch (a) First half (b) Last half (c) All (d) Final
SGD BCSC SGD BCSC SGD BCSC SGD BCSC
LeNet4 15.16 22.94 36.74 39.75 25.95 31.35 41.26 42.35
MobileNet 39.36 51.47 63.80 67.79 51.58 59.63 65.21 68.88
ShuffleNet 43.57 53.75 67.77 69.57 55.67 61.66 69.12 70.36
VGG19 38.47 51.48 69.48 72.38 53.98 61.93 72.14 74.19
ResNet18 52.14 60.21 74.35 76.79 63.24 68.50 76.08 77.28
SENet18 52.90 60.09 75.38 76.98 64.14 68.53 77.28 77.30
DenseConv 51.91 60.02 75.68 76.84 63.79 68.43 77.22 77.46
ResNeXt29 52.65 62.39 77.52 78.97 65.09 70.68 78.88 79.37
GoogLeNet 51.14 60.97 78.33 79.68 64.73 70.33 79.51 80.20
DPN92 54.58 63.88 78.30 79.48 66.44 71.68 79.98 80.23
Table 5: Test accuracy for Cifar100 (%)

References