A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training

03/24/2018 ∙ by Tim Tsz-Kit Lau, et al. ∙ 0

Training deep neural networks (DNNs) efficiently is a challenge due to the associated highly nonconvex optimization. The backpropagation (backprop) algorithm has long been the most widely used algorithm for gradient computation of parameters of DNNs and is used along with gradient descent-type algorithms for this optimization task. Recent work have shown the efficiency of block coordinate descent (BCD) type methods empirically for training DNNs. In view of this, we propose a novel algorithm based on the BCD method for training DNNs and provide its global convergence results built upon the powerful framework of the Kurdyka-Lojasiewicz (KL) property. Numerical experiments on standard datasets demonstrate its competitive efficiency against standard optimizers with backprop.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Backprop (Rumelhart et al., 1986)

is the most prevalent approach of computing gradients in DNN training. It is mostly used together with gradient descent-type algorithms, notably the classical stochastic gradient descent (SGD)

(Robbins & Monro, 1951) and its variants such as Adam (Kingma & Ba, 2015). Regardless of the optimizers chosen for network training, backprop suffers from vanishing gradients (Bengio et al., 1994; Pascanu et al., 2013; Goodfellow et al., 2016)

. Various methods were proposed to alleviate this issue, e.g., rectified linear units (ReLUs)

(Nair & Hinton, 2010)

and Long Short-Term Memory

(Hochreiter & Schmidhuber, 1997), but these methods are unable to completely tackle this inherent problem to backprop. One viable alternative to backprop with gradient-based optimizers to avoid vanishing gradients is to adopt gradient-free methods, including (but not limited to) the alternating direction method of multipliers (ADMM) (Taylor et al., 2016; Zhang et al., 2016) and the block coordinate descent (BCD) method (Carreira-Perpiñán & Wang, 2014; Zhang & Brand, 2017). The main idea of ADMM and BCD is to decompose the highly coupled and composite DNN training objective into several loosely coupled and almost separable simple subproblems. The efficiency of both ADMM and BCD has been illustrated empirically in Taylor et al. (2016), Zhang et al. (2016) and Zhang & Brand (2017)

. Meanwhile, BCD has been tremendously studied for nonconvex problems in machine learning

(see e.g., Jain & Kar, 2017).

In this paper, we propose a novel algorithm based on BCD of Gauss-Seidel type. We define the loss function using the quadratic penalty method

(Nocedal & Wright, 1999)

by unrolling the nested structure of DNNs into separate “single-layer” training tasks. This algorithm involves simplifications of commonly used activation functions as projections onto nonempty closed convex sets (commonly used in convex analysis

(Rockafellar & Wets, 1998)) so that the overall loss function is block multiconvex (Xu & Yin, 2013). This property allows us to obtain global convergence guarantees under the framework of KL property (Attouch et al., 2013; Bolte et al., 2014).

2 Related work

Carreira-Perpiñán & Wang (2014) and Zhang & Brand (2017)

also suggest the use of BCD for training DNNs and observe empirically the per epoch efficiency where the training loss drops much faster than SGD. Multiple related work consider a similar scheme to ours. A very recent piece of work

(Frerix et al., 2018) implements proximal steps for model parameter updates only but keep gradient steps for updating the activation parameters and the output layer parameters, whilst we also apply proximal steps for updating these parameters. In the problem formulation of Zhang & Brand (2017)

, bias vectors are not used in the layers. They concatenate all weight matrices in all hidden layers and all activation vectors (defined below) respectively into two separate blocks and update together with the weight matrix of the output layer so that these three blocks are updated alternately instead of an overall backward order as in backprop.

Carreira-Perpiñán & Wang (2014) consider a specific DNN using squared loss and sigmoid activation function, and propose the so-called method of auxiliary coordinate (MAC). Our problem formulation is similar, but is further simplified described in the next section.

3 The proximal block coordinate descent algorithm

3.1 Preliminaries and notations

We consider a feedforward neural network with hidden layers. Let be the number of nodes of the -th layer, be the number of training samples and be the number of classes. Note that . We adopt the following notations: the -th training data, , the one-hot vector of its corresponding label, the -th entry of the column vector , , the weight matrix between the -th and -th hidden layers, the bias vector of the -th hidden layer, the weight matrix between the last hidden layer and the output layer, the bias vector of the output layer. We denote a general activation function by , and .

3.2 Problem formulation

In the training of regularized DNNs, we are solving the following optimization problem:

(1)

where is a generic convex loss function and , , are convex but possibly nonsmooth regularizers, the set of all activation vectors, , for all , and . The problem (1) can be reformulated as follows:

(2)

where is the standard Euclidean norm, and for all

. The above scheme allows for any general activation functions such as hyperbolic tangent, sigmoid and ReLU. However, the formulation (

2) is generally hard to solve explicitly (say, for tanh and sigmoid) but can be simplified if we consider the constraint as a projection onto a convex set. For instance, ReLU can be thought of as a projection onto the closed upper half-space. This is equivalent to imposing the constraint for all , and for all

. For hyperbolic tangent and sigmoid functions, nonsmooth approximations are needed and are discussed in

Section A.1. Inspired by the formulation in Zhang et al. (2016), we further introduce a set of auxiliary variables to the objective to get:

(3)

where is the set of auxiliary variables, is a nonempty closed convex set and is the indicator function of a nonempty closed convex set so that for and otherwise. For the case of the ReLU activation function, . This formulation (3) is more desirable since we eliminate the variable

to be optimized which probably speeds up the training. Also note that the objective function (

3) is block multiconvex which allows for established convergence guarantees in existing literature using the proposed algorithm.

3.3 The proposed algorithm

In our minimization algorithm, we perform a proximal step (as in the proximal point algorithm) for each parameter except the auxiliary variables (which are updated by direct minimization instead of dual gradient ascent in Zhang et al. (2016)) in a Gauss-Seidel fashion, and in a backward order based on the network structure. Adaptive momentum (Lau & Yao, 2017), though not included in the convergence analysis, is also used after each proximal point update due to its empirical usefulness for convergence. The overall algorithm is depicted in Algorithm 1 (see Section A.2).

3.4 Convergence results

The problem formulation (3) involves regularized block multiconvex optimization and the proposed algorithm fits the general framework of Xu & Yin (2013). We analyze the convergence of Algorithm 1 under the assumptions in Section A.4 (see Section A.4). The proof of the following theorem and its related convergence rate is given in Section A.4.

Theorem 1 (Global convergence)

Under Section A.4, Proposition 1 and the fact that the sequence generated by Algorithm 1 has a finite limit point where satisfies the the Kurdyka-Łojasiewicz (KL) property (Definition 2, see Section A.4), the sequence converges to , which is a critical point of (3).

4 Experimental results

We conduct experiments for two different structures on CIFAR-10 (Krizhevsky et al., 2009)

with 50K training and 10K test samples, namely a 3072-4K-4K-4K-10 MLP and a 3072-4K-3072-4K-10 DNN with a residual connection in the second hidden layer (ResNet

(He et al., 2016)). Experimental results on MNIST are in Appendix B

. The BCD algorithm (20 epochs) is implemented using MATLAB while backprop (SGD; 100 epochs) is implemented using Keras with TensorFlow backend. Squared losses, ReLUs are used without regularizations. All weight matrices are initialized from a Gaussian distribution with a standard deviation of 0.01 and the bias vectors are initialized as vectors of all 0.1, while

and

are initialized by a single forward pass. Hyperparameters in BCD (

) and the learning rate () in SGD are tuned manually. We report the training and test accuracies (the median of 5 runs) as follows:

(a) MLP; SGD
(b) MLP; BCD:
,
(c) ResNet; SGD
(d) ResNet; BCD: ,
Figure 1: Training and test accuracies. Final test acc.: 0(a): 0.4765; 0(b): 0.4682; 0(c): 0.494; 0(d): 0.4843.

5 Conclusion and future work

In this paper, we proposed an efficient BCD algorithm and established its convergence guarantees according to our block multiconvex formulation. Three major advantages of BCD include: (a) high per epoch efficiency at early stages (observed in Figure 1), i.e., the training and test accuracies of BCD grow much faster than SGD in terms of epoch at the early stage; (b) good scalability, i.e., BCD can be implemented in a distributed and parallel manner via data parallelism on multi-core CPUs; (c) gradient free, i.e., gradient computations are unnecessary used for the updates. One flaw of the BCD methods is that they generally require more memory than SGD method. Thus, a future direction is to study the feasibility of the stochastic and parallel block coordinate descent methods for DNN training.

Acknowledgments

The authors would like to thank Ruohan Zhan for sharing experimental results on deep neural network training using BCD and ADMM algorithms.

References

  • Attouch et al. (2013) Hedy Attouch, Jérôme Bolte, and Benar Fux Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Prog., 137(1):91–129, Feb. 2013.
  • Bauschke & Combettes (2017) Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer International Publishing, 2nd edition, 2017.
  • Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
  • Bochnak et al. (1998) Jacek Bochnak, Michel Coste, and Marie-Francoise Roy. Real algebraic geometry, volume 3. Ergeb. Math. Grenzgeb. Springer-Verlag, Berlin, 1998.
  • Bolte et al. (2007a) Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17:1205–1223, 2007a.
  • Bolte et al. (2007b) Jérôme Bolte, Aris Daniilidis, Adrian Lewis, and Masahiro Shiota. Clark subgradients of stratifiable functions. SIAM Journal on Optimization, 18:556–572, 2007b.
  • Bolte et al. (2014) Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
  • Carreira-Perpiñán & Wang (2014) Miguel Carreira-Perpiñán and Weiran Wang. Distributed optimization of deeply nested systems. In

    Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics

    , pp. 10–19, 2014.
  • Combettes & Pesquet (2011) Patrick L. Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Heinz H. Bauschke, Regina S. Burachik, Patrick L. Combettes, Veit Elser, D. Russell Luke, and Henry Wolkowicz (eds.), Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer New York, 2011.
  • Frerix et al. (2018) Thomas Frerix, Thomas Möllenhoff, Michael Moeller, and Daniel Cremers. Proximal backpropagation. Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.

    In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.
  • Jain & Kar (2017) Prateek Jain and Purushottam Kar. Non-convex optimization for machine learning. Foundations and Trends®  in Machine Learning, 10(3–4):142–336, 2017.
  • Kingma & Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2015.
  • Krizhevsky et al. (2009) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research). 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html.
  • Kurdyka (1998) Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. Annales de l’institut Fourier, 48:769–783, 1998.
  • Lau & Yao (2017) Tsz Kit Lau and Yuan Yao.

    Accelerated block coordinate proximal gradients with applications in high dimensional statistics.

    The 10th NIPS Workshop on Optimization for Machine Learning, 2017.
  • LeCun & Cortes (2010) Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/exdb/mnist/.
  • Łojasiewicz (1963) Stanisław Łojasiewicz. Une propriété topologique des sous-ensembles analytiques réels. In: Les Équations aux dérivées partielles. Éditions du centre National de la Recherche Scientifique, Paris, pp. 87–89, 1963.
  • Łojasiewicz (1993) Stanisław Łojasiewicz. Sur la geometrie semi-et sous-analytique. Annales de l’institut Fourier, 43:1575–1595, 1993.
  • Moreau (1962) Jean-Jacques Moreau. Fonctions convexes duales et points proximaux dans un espace hilbertien. Comptes Rendus de l’Académie des Sciences (Paris), Série A, 255:2897–2899, 1962.
  • Nair & Hinton (2010) Vinod Nair and Geoffrey E. Hinton.

    Rectified linear units improve restricted Boltzmann machines.

    In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814, 2010.
  • Nocedal & Wright (1999) Jorge Nocedal and Stephen J. Wright. Numerical optimization. Springer, 1999.
  • Pascanu et al. (2013) Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.

    On the difficulty of training recurrent neural networks.

    In Proceedings of the 30th International Conference on Machine Learning, volume 28, pp. 1310–1318, 2013.
  • Robbins & Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • Rockafellar & Wets (1998) R. Tyrrell Rockafellar and Roger J.-B. Wets. Variational Analysis. Springer Verlag, 1998.
  • Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. Nature, 323:533–536, 1986.
  • Taylor et al. (2016) Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: A scalable ADMM approach. In Proceedings of The 33rd International Conference on Machine Learning, pp. 2722–2731, 2016.
  • Xu & Yin (2013) Yangyang Xu and Wotao Yin.

    A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion.

    SIAM Journal on Imaging Sciences, 6(3):1758–1789, 2013.
  • Zhang & Brand (2017) Ziming Zhang and Matthew Brand. Convergent block coordinate descent for training tikhonov regularized deep neural networks. In Advances in Neural Information Processing Systems 30, pp. 1719–1728. 2017.
  • Zhang et al. (2016) Ziming Zhang, Yuting Chen, and Venkatesh Saligrama. Efficient training of very deep neural networks for supervised hashing. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1487–1495, 2016.
  • Zou & Hastie (2005) Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.

Appendices

Appendix A Algorithmic details

a.1 Simplifications of activation functions

We first give the definition of the proximity operator which is required in the following analysis.

Definition 1 (Proximity operator (Moreau, 1962; Combettes & Pesquet, 2011))

Let be a positive parameter, be a real Hilbert space (e.g., Euclidean space) and the function . The proximity operator of the function is defined through

If is convex, proper and lower semicontinuous, admits a unique solution. If is nonconvex, then it is generally set-valued.

For the ReLU activation function, we can consider it as the projection onto the nonempty closed convex set , where is the dimension of the input variable. This is based on Bauschke & Combettes (2017, Proposition 24.47), which states that for any proper lower semicontinuous convex function on and any closed interval on such that ,

where is the projection operator onto the nonempty closed convex set . Since all the activation functions are elementwise operations, according to Bauschke & Combettes (2017, Proposition 24.11), we can extend the above results to the Euclidean space , i.e.,

where , and .

Likewise, the sigmoid and tanh activation functions can be used with some tricks in this scheme. It should be noted that these two functions are not simple projections onto nonempty closed convex sets. Instead, if we consider the nonsmooth surrogates of them, they can be imposed as projections onto nonempty closed convex sets which are much easier to obtain.

For the tanh function, we use the following nonsmooth function (a.k.a. hard tanh) as an approximation:

Then we have

For the sigmoid function , recall that we have the following relationship:

Using function transformation, the nonsmooth approximation of the sigmoid function (a.k.a. hard sigmoid111Hard sigmoid can also be defined as: if , if , and if , as defined in TensorFlow. ) is

We define the closed convex set . Then we have

a.2 The proximal BCD algorithm

Note that the extrapolations and adaptive momentums in the following algorithm are not considered in the convergence results but implemented in numerical experiments.

Input: training data and regularization parameters and for all and .
Initialize , , , for all , and ; initialized by forward propagation. for  do
      
      [2mm]
      [1mm]
      [1mm] if  then
            
      else
            
      for  do
             for all for all if  then
                  
            else
                  
            
            
            [2mm]
            [2mm] if  then
                  
            else
                  
            
      
Output: ,
Algorithm 1 Proximal Block Coordinate Descent (BCD) Algorithm for Training DNNs

a.3 Implementation details

For instance, if we take , then we have

For all , and for ,

a.4 Assumptions, definitions, propositions and related proofs

Assumption

We have several assumptions on the functions : (i) The loss function is chosen such that is continuous222Note that the indicator function is continuous on since according to its definition, if and otherwise, for all and . In general, the indicator function is called an extended-value convex function since it equals if the variable is not in the domain of . and bounded below on . Problem (3) has a Nash point (Xu & Yin, 2013); (ii) For each block , there exist constant such that the proximal parameters obeys .

Definition 2 (Kurdyka-Łojasiewicz property and KL function (Bolte et al., 2014))
  1. The function is said to have the Kurdyka-Łojasiewicz property at if there exist , a neighborhood of and a continuous concave function for some and such that for all , the Kurdyka-Łojasiewicz inequality holds

  2. Proper lower semincontinuous functions which satisfy the Kurdyka-Łojasiewicz inequality at each point of are called KL functions.

Definition 3 (Semialgebraic function)
  1. A set is called semialgebraic (Bochnak et al., 1998) if it can be represented as

    where are real polynomial functions for

  2. A function is called semialgebraic if its graph is a semialgebraic set.

Remark 1

KL functions include real analytic functions, functions on the -minimal structure (Kurdyka, 1998), subanalytic functions (Bolte et al., 2007a, b), semialgebraic functions (see Definition 3) and locally strongly convex functions. Some other important facts regarding KL functions available in Łojasiewicz (1963, 1993); Kurdyka (1998); Bolte et al. (2007a, b); Attouch et al. (2013); Xu & Yin (2013) and references therein are summarized below.

  1. The sum of a real analytic function and a semialgebraic function is a subanalytic function, and thus a KL function.

  2. If a set is semialgebraic, so is its closure .

  3. If and are both semialgebraic, so are , , and .

  4. Indicator functions of semialgebraic sets are semialgebraic, e.g., the indicator functions of nonnegative closed half space and a nonempty closed interval.

  5. Finite sums and products of semialgebraic (real analytic) functions are semialgebraic (real analytic).

  6. The composition of semialgebraic functions is semialgebraic.

Proposition 1

The objective function in (3) is a KL function, if the loss function is chosen as one of the commonly used loss functions such as the squared loss, logistic loss, hinge loss or softmax cross-entropy loss, and the regularizers ’s are chosen as norms, squared norms (a.k.a. weight decay), quasi-norms with , or their sums such as the elastic net (Zou & Hastie, 2005).

Proof

Recall that

Any polynomial function is real analytic, so is real analytic by Remark 1 item 5. In the same vein, the square loss function is also real analytic. The logistic loss and softmax cross-entropy loss are also real analytic (Xu & Yin, 2013). If is the hinge loss, i.e., given , for any , it is semialgebraic, because its graph is , a closure of the set , where

Then again by Remark 1 item 5, is either a real analytic function (squared, logistic and softmax cross-entropy losses) or a semialgebraic function (hinge loss). is semialgebraic by Remark 1 items 5 and 4 since is a nonempty closed interval for all depicted in Section A.1.

Concerning , which is the sum of the regularizers ’s, note that the norm, the squared norm, the quasi-norms with are all semialgebraic, and thus, the elastic net is also semialgebraic. By Remark 1 item 5, is also semialgebraic.

Finally, using Remark 1 item 1, we conclude that is subanalytic and hence a KL function.

We now present the proof of Theorem 1.

Proof (Theorem 1)

Note that is monotonically nonincreasing and converges to . If at some , then for all . It remains to consider for all . Since is a limit point and , there must exist an integer such that is sufficiently close to . Hence, converges according to Xu & Yin (2013, Lemma 2.6).

Theorem 2 (Convergence rate)

Suppose that converges to a critical point , at which satisfies the KL property with for some and . Then the following hold:

  1. If , converges to in finitely many iterations.

  2. If , for all , for certain , , .

  3. If , for all , for certain , .

These three parts correspond to finite convergence, linear convergence, and sublinear convergence, respectively.

Proof

See the proof of Theorem 2.9 of Xu & Yin (2013).

Appendix B Further experimental results

We further conduct experiments for two different structures on the MNIST dataset (LeCun & Cortes, 2010) with 60K training and 10K test samples, namely a 784-2048-2048-2048-10 MLP and a 784-2048-784-2048-10 DNN with a residual connection in the second hidden layer (ResNet). The BCD algorithm (30 epochs for MLP; 20 epochs for ResNet) is implemented using MATLAB while backprop (SGD; 100 epochs) is implemented using Keras with TensorFlow backend. Squared losses, ReLUs are used without regularizations. All weight matrices are initialized from a Gaussian distribution with a standard deviation of 0.01 and the bias vectors are initialized as vectors of all 0.1, while and are initialized by a single forward pass. The hyperparameters in BCD () and the learning rate () in SGD are chosen by manual tuning. We report the training and test accuracies (the median of 5 runs) as follows:

(a) MLP; SGD
(b) ResNet; SGD
(c) MLP; BCD:
,
(d) ResNet; BCD:
,
Figure 2: Training and test accuracies. Final test acc.: 1(a) & 1(b): 0.9533; 1(c): 0.9458; 1(d): 0.9537.

From Figure 2, we observe that the BCD algorithms require much fewer epochs to achieve similar test accuracies. Thus, we say that the BCD method has high per epoch efficiency compared to backprop with SGD.