# Block Coordinate Descent for Deep Learning: Unified Convergence Guarantees

Training deep neural networks (DNNs) efficiently is a challenge due to the associated highly nonconvex optimization. Recently, the efficiency of the block coordinate descent (BCD) type methods has been empirically illustrated for DNN training. The main idea of BCD is to decompose the highly composite and nonconvex DNN training problem into several almost separable simple subproblems. However, their convergence property has not been thoroughly studied. In this paper, we establish some unified global convergence guarantees of BCD type methods for a wide range of DNN training models, including but not limited to multilayer perceptrons (MLPs), convolutional neural networks (CNNs) and residual networks (ResNets). This paper nontrivially extends the existing convergence results of nonconvex BCD from the smooth case to the nonsmooth case. Our convergence analysis is built upon the powerful Kurdyka-Łojasiewicz (KL) framework but some new techniques are introduced, including the establishment of the KL property of the objective functions of many commonly used DNNs, where the loss function can be taken as squared, hinge and logistic losses, and the activation function can be taken as rectified linear units (ReLUs), sigmoid and linear link functions. The efficiency of BCD method is also demonstrated by a series of exploratory numerical experiments.

• 17 publications
• 2 publications
• 6 publications
• 88 publications
03/24/2018

### A Proximal Block Coordinate Descent Algorithm for Deep Neural Network Training

Training deep neural networks (DNNs) efficiently is a challenge due to t...
05/23/2018

### A Unified Framework for Training Neural Networks

The lack of mathematical tractability of Deep Neural Networks (DNNs) has...
08/30/2022

### Convergence Rates of Training Deep Neural Networks via Alternating Minimization Methods

Training deep neural networks (DNNs) is an important and challenging opt...
02/06/2019

### A Convergence Analysis of Nonlinearly Constrained ADMM in Deep Learning

Efficient training of deep neural networks (DNNs) is a challenge due to ...
06/19/2022

### 0/1 Deep Neural Networks via Block Coordinate Descent

The step function is one of the simplest and most natural activation fun...
10/10/2021

### Convergence of Random Reshuffling Under The Kurdyka-Łojasiewicz Inequality

We study the random reshuffling (RR) method for smooth nonconvex optimiz...
11/15/2017

### Can CNNs Construct Highly Accurate Model Efficiently with Limited Training Samples?

It is well known that metamodel or surrogate modeling techniques have be...

## 1 Introduction

Tremendous research activities have been dedicated to deep learning due to its great success in some real-world applications such as image classification in computer vision

(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012; Sainath et al., 2013), statistical machine translation (Devlin et al., 2014), and especially outperforming human in Go games (Silver et al., 2016).

The practical optimization algorithms for training neural networks can be mainly divided into three categories in terms of the amount of first- and second-order information used, namely, gradient-based, (approximate) second-order and gradient-free methods. Gradient-based methods make use of backpropagation

(Rumelhart et al., 1986)

to compute gradients of network parameters. Stochastic gradient descent (SGD) method proposed by

, it is observed that vanilla SGD fails to train a ten-hidden-layer MLPs while BCD still works and achieves a moderate accuracy within a few epochs. The other reason is that BCD and ADMM can be easily implemented in a distributed and parallel manner

(Boyd et al., 2011; Mahajan et al., 2017), therefore in favour of distributed/decentralized scenarios.

The BCD methods currently adopted in DNN training run into two categories depending on the specific formulations of the objective functions, namely, the two-splitting formulation and three-splitting formulation (shown in (2.2) and (2.4)), respectively. Examples of the two-splitting formulation include Carreira-Perpiñán and Wang (2014); Zhang and Brand (2017); Askari et al. (2018); Gu et al. (2018), whilst Taylor et al. (2016); Lau et al. (2018) adopt the three-splitting formulation. Convergence studies of BCD methods appeared recently in some restricted settings. In Zhang and Brand (2017)

, a BCD method was suggested to solve the Tikhonov regularized deep neural network training problem using a lifting trick to avoid the computational hurdle imposed by ReLU. Its convergence was established through the framework of

Xu and Yin (2013), where the block multiconvexity222A function with multi-block variables is called block multiconvex if it is convex with respect to each block variable when fixing the other blocks, and is called blockwise Lipschitz differentiable if it is differentiable with respect to each block variable and its gradient is Lipschitz continuous while fixing the others. and differentiability of the unregularized part of the objective function play central roles in the analysis. However, for other commonly used activations such as sigmoid, the convergence analysis of Xu and Yin (2013) cannot be directly applied since the multiconvexity may be violated. Askari et al. (2018) and Gu et al. (2018) extended the lifting trick introduced by Zhang and Brand (2017) to deal with a class of strictly increasing and invertible activations, and then adapted BCD methods to solve the lifted DNN training models. However, no convergence guarantee was provided in both Askari et al. (2018) and Gu et al. (2018). Following the similar lifting trick as in Zhang and Brand (2017), Lau et al. (2018) proposed a proximal BCD based on the three-splitting formulation of the regularized DNN training problem with ReLU activation. The global convergence was also established through the analysis framework of Xu and Yin (2013). However, similar convergence results for other commonly used activation functions are still lacking.

In this paper, we aim to fill these gaps. Our main contribution is to provide a general methodology to establish the global convergence333Global convergence used in this paper means that the algorithm converges starting from any finite initialization. of these BCD methods in the common DNN training settings, without requiring the block multiconvexity and differentiability assumptions as in Xu and Yin (2013). Instead, our key assumption is the Lipschitz continuity of the activation on any bounded set (see Section 3.1(b)). Specifically, Theorem 1 establishes the global convergence to a critical point at a rate of the BCD methods using the proximal strategy, while extensions to the prox-linear strategy for general losses are provided in Theorem 2 and to residual networks (ResNets) are shown in Theorem 3. Our assumptions are applicable to most cases appeared in the literature. Specifically in Theorem 1, if the loss function, activations, and convex regularizers are lower semicontinuous and either real-analytic (see Definition 1) or semialgebraic (see Definition 2), and the activations are Lipschitz continuous on any bounded set, then BCD converges to a critical point at the rate of starting from any finite initialization, where is the iteration number. Note that these assumptions are satisfied by most commonly used DNN training models, where (a) the loss function can be any of the squared, logistic, hinge, exponential or cross-entropy losses, (b) the activation function can be any of ReLU, leaky ReLU, sigmoid, tanh, linear, or polynomial functions, and (c) the regularizer can be any of the squared norm, squared Frobenius norm, the elementwise -norm, or the sum of squared Frobenuis norm and elementwise

-norm (say, in the vector case, the elastic net by

Zou and Hastie, 2005), or the indicator function of the nonnegative closed half space or a closed interval (see Proposition 1).

Our analysis is based on the Kurdyka-Łojasiewicz (KŁ) inequality (Łojasiewicz, 1993; Kurdyka, 1998) framework formulated in Attouch et al. (2013). However there are several different treatments compared to the state-of-the-art work (Xu and Yin, 2013) that enables us to achieve the general convergence guarantee aforementioned. According to Attouch et al. (2013, Theorem 2.9), the sufficient descent, relative error and continuity conditions, together with the assumption yield the global convergence of a nonconvex algorithm. In order to obtain the sufficient descent condition, we exploit the proximal strategy for all non-strongly convex subproblems (see Algorithm 2 and Lemma 1), without requiring block multiconvex assumption used in Xu and Yin (2013, Lemma 2.6). In order to establish the relative error condition, we use the Lipschitz continuity of the activation functions and do some careful treatments on the specific updates of the BCD methods (see Lemma 2), without requiring the (locally) Lipschitz differentiability of the unregularized part as used in Xu and Yin (2013, Lemma 2.6). The continuity condition is established via the lower semicontinuity assumptions of the loss, activations and regularizers. The treatments of this paper are of their own value to the optimization community. The detailed comparisons between this paper and the existing literature can be found in Section 4.

The rest of this paper is organized as follows. Section 2 describes the BCD methods when adapted to the splitting formulations of DNN training problems. Section 3 establishes their global convergence results, followed by some extensions. Section 4 illustrates the key ideas of proof with some discussions. We conclude this paper in Section 5.

## 2 DNN training via BCD

In this section, we describe the specific forms of BCD involving both two- and three-splitting formulations.

### 2.1 DNN training with variable splitting

Consider -layer feedforward neural networks including hidden layers of the neural networks. Particularly, let be the number of hidden units in the -th hidden layer for . Let and be the number of units of input and output layers, respectively. Let be the weight matrix between the -th layer and the -th layer for any 444To simplify notations, we regard the input and output layers as the -th and -th layers, respectively, and absorb the bias of each layer into .. Let be samples, where ’s are the one-hot vectors of labels. Denote , and . With the help of these notations, the DNN training problem can be formulated as the following empirical risk minimization:

 minWRn(Φ(X;W),Y), (2.1)

where , is some loss function, is the neural network model with layers and weights and is the activation function of the -th layer (generally, , i.e., the identity function) and is called empirical risk (also known as the training loss).

Note that the DNN training model (2.1) is highly nonconvex as the variables are coupled via the deep neural network architecture, which brings many challenges for the design of efficient training algorithms and also its theoretical analysis. To make Problem (2.1) more computationally tractable, variable splitting is one of the most commonly used ways (Taylor et al., 2016; Zhang and Brand, 2017; Askari et al., 2018; Gu et al., 2018; Lau et al., 2018). The main idea of variable splitting is to transform a complicated problem (where the variables are coupled highly nonlinearly) into a relatively simpler one (where the variables are coupled much looser) via introducing some additional variables.

#### 2.1.1 Two-splitting formulation.

Considering general deep neural network architectures, the DNN training problem can be naturally formulated as the following model (called two-splitting formulation)555Here we consider the regularized DNN training model. The model reduces to the original DNN training model (2.1) without regularization. :

 minW,VL0(W,V) :=Rn(VN;Y)+N∑i=1ri(Wi)+N∑i=1si(Vi) subject toVi =σi(WiVi−1),i=1,…,N, (2.2)

where denotes the empirical risk, , is the -th column of . In addition, and are extended-real-valued, nonnegative functions revealing the priors of the weight variable and the state variable (or the constraints on and ) for each , and define . In order to solve the two-splitting formulation (2.2), the following alternative minimization problem was suggested in the literature:

 minW,VL(W,V):=L0(W,V)+γ2N∑i=1∥Vi−σi(WiVi−1)∥2F, (2.3)

where

is a hyperparameter

666In (2.5), we use a uniform hyperparameter for the sum of all quadratic terms for the simplicity of notation. In practice, can be different for each quadratic term and our proof still goes through..

The DNN training model (2.2) can be very general, where: (a) can be the squared, logistic, hinge, cross-entropy or other commonly used loss functions; (b) can be ReLU, leaky ReLU, sigmoid, linear, or other commonly used activation functions; (c) can be the squared norm, the norm, the elastic net (Zou and Hastie, 2005), the indicator function of some nonempty closed convex set777The indicator function of a nonempty convex set is defined as if and otherwise. (such as the nonnegative closed half space or a closed interval ), or others; (d) can be the norm (Ji et al., 2014), the indicator function of some convex set with simple projection (Zhang and Brand, 2017), or others. Particularly, if there is no regularizer or constraint on (or ), then (or ) can be zero.

The network architectures considered in this paper exhibit generality to various types of DNNs, including but not limited to the fully (or sparse) connected MLPs (Rosenblatt, 1961), convolutional neural networks (CNNs) (Fukushima, 1980; LeCun et al., 1998) and residual neural networks (ResNets) (He et al., 2016). For CNNs, the weight matrix is sparse and shares some symmetry structures represented as permutation invariants, which are linear constraints and up to a linear reparameterization all the main results below are still valid.

Various existing BCD algorithms for DNN training (Carreira-Perpiñán and Wang, 2014; Zhang and Brand, 2017; Askari et al., 2018; Gu et al., 2018) can be regarded as special cases in terms of the use of the two-splitting formulation (2.2). In fact, Carreira-Perpiñán and Wang (2014) considered a specific DNN training model with squared loss and sigmoid activation function, and proposed the method of auxiliary coordinate (MAC) based on the two-splitting formulation of DNN training (2.2), as a two-block BCD method with the weight variables as one block and the state variables as the other. For each block, a nonlinear least squares problem is solved by some iterative methods. Furthermore, Zhang and Brand (2017) proposed a BCD type method for DNN training with ReLU and squared loss. To avoid the computational hurdle imposed by ReLU, the DNN training model was relaxed to a smooth multiconvex formulation via lifting ReLU into a higher dimensional space (Zhang and Brand, 2017). Such a relaxed BCD is in fact a special case of two-splitting formulation (2.3) with , , where is the nonnegative closed half-space with the same dimension of , while Askari et al. (2018) and Gu et al. (2018) extended such lifting trick to more general DNN training settings, of which the activation can be not only ReLU, but also sigmoid and leaky ReLU. The general formulations studied in these two papers are also special cases of the two-splitting formulation with different and .

#### 2.1.2 Three-splitting formulation.

Note that the variables and are coupled by the nonlinear activation function in the -th constraint of the two-splitting formulation (2.2), which may bring some difficulties and challenges for solving problem (2.2) efficiently, particularly, when the activation function is ReLU. Instead, the following three-splitting formulation was used in Taylor et al. (2016); Lau et al. (2018):

 minW,V,UL0(W,V)subject toUi=WiVi−1,Vi=σi(Ui),i=1,…,N, (2.4)

where . From (2.4), the variables are coupled much more loosely, particularly for variables and . As described later, such a three-splitting formulation can be beneficial to designing some more efficient methods, though auxiliary variables ’s are introduced. Similarly, the following alternative unconstrained problem was suggested in the literature:

 (2.5)

### 2.2 Description of BCD algorithms

In the following, we describe how to adapt the BCD method to Problems (2.3) and (2.5). The main idea of the BCD method of Gauss-Seidel type for a minimization problem with multi-block variables is to update all the variables cyclically while fixing the remaining blocks at their last updated values (Xu and Yin, 2013). In this paper, we consider the BCD method with the backward order (but not limited to this as discussed later) for the updates of variables, that is, the variables are updated from the output layer to the input layer, and for each layer, we update the variables cyclically for Problem (2.3) as well as the variables cyclically for Problem (2.5). Since , the output layer is paid special attention. Particularly, for most blocks, we adopt the proximal update strategies for two major reasons: (1) To practically stabilize the training process; (2) To yield the desired “sufficient descent” property for theoretical justification. For each subproblem, we assume that its minimizer can be achieved. The BCD algorithms for Problems (2.3) and (2.5) can be summarized in Algorithms 2 and 1, respectively.

One major merit of Algorithm 2 over Algorithm 1 is that in each subproblem, almost all updates are simple proximal updates999For -update, we can regard as a new proximal function . (or just least squares problems), which usually have closed form solutions to many commonly used DNNs, while a drawback of Algorithm 2 over Algorithm 1 is that more storage memory is required due to the introduction of additional variables . Some typical examples leading to the closed form solutions include: (a) are (i.e., no regularization), or the squared norm (a.k.a. weight decay), or the indicator function of a nonempty closed convex set with a simple projection like the nonnegative closed half space and the closed interval ; (b) the loss function is the squared loss or hinge loss (see Lemma 14 in Section E.2); and (c) is ReLU (see Lemma 13 in Section E.1), leaky ReLU, or linear link function. For other cases in which and are the norm,

is the sigmoid function, and the loss

is the logistic function, the associated subproblems can be also solved cheaply via some efficient existing methods. Discussions on specific implementations of these BCD methods can be referred to Appendix A.

## 3 Global convergence analysis of BCD

In this section, we establish the global convergence of both Algorithm 1 for Problem (2.3), and Algorithm 2 for Problem (2.5), followed by some extensions.

### 3.1 Main assumptions

First of all, we present our main assumptions, which involve the definitions of real analytic and semialgebraic functions.

Let be an extended-real-valued function (respectively, be a point-to-set mapping), its graph is defined by

 Graph(h):={(x,y)∈Rp×R:y=h(x)}, (resp. Graph(h):={(x,y)∈Rp×Rq:y∈h(x)}),

and its domain by (resp. ). When is a proper function, i.e., when the set of its global minimizers (possibly empty) is denoted by

 argminh:={x∈Rp:h(x)=infh}.
###### Definition 1 (Real analytic)

A function with domain an open set and range the set of either all real or complex numbers, is said to be real analytic at if the function may be represented by a convergent power series on some interval of positive radius centered at : for some . The function is said to be real analytic on if it is real analytic at each (Krantz and Parks, 2002, Definition 1.1.5). The real analytic function over for some positive integer can be defined similarly.

According to Krantz and Parks (2002), typical real analytic functions include polynomials, exponential functions, and the logarithm, trigonometric and power functions on any open set of their domains. One can verify whether a multivariable real function on is analytic by checking the analyticity of for any .

###### Definition 2 (Semialgebraic)
1. A set is called semialgebraic (Bochnak et al., 1998) if it can be represented as

 D=s⋃i=1t⋂j=1{x∈Rp:Pij(x)=0,Qij(x)>0},

where are real polynomial functions for

2. A function (resp. a point-to-set mapping ) is called semialgebraic if its graph is semialgebraic.

According to Łojasiewicz (1965); Bochnak et al. (1998) and Shiota (1997, I.2.9, page 52), the class of semialgebraic sets is stable under the operation of finite union, finite intersection, Cartesian product or complementation. Some typical examples include polynomial functions, the indicator function of a semialgebraic set, and the Euclidean norm (Bochnak et al., 1998, page 26).

###### Assumption

Suppose that

1. the loss function is a proper lower semicontinuous101010A function is called lower semicontinuous if for any . and nonnegative function,

2. the activation functions () are Lipschitz continuous on any bounded set,

3. the regularizers and () are nonegative lower semicontinuous convex functions, and

4. all these functions , , and () are either real analytic or semialgebraic, and continuous on their domains.

According to Krantz and Parks (2002); Łojasiewicz (1965); Bochnak et al. (1998) and Shiota (1997, I.2.9, page 52), most of the commonly used DNN training models (2.2) can be verified to satisfy Section 3.1 as shown in the following proposition, the proof of which is provided in Appendix B.

###### Proposition 1

Examples satisfying Section 3.1 include:

1. is the squared, logistic, hinge, or cross-entropy losses;

2. is ReLU, leaky ReLU, sigmoid, hyperbolic tangent, linear, or polynomial activations;

3. and are the squared norm, the norm, the elastic net, the indicator function of some nonempty closed convex set (such as the nonnegative closed half space, box set or a closed interval ), or if no regularization.

### 3.2 Main theorem

Under Assumption 3.1, we state our main theorem as follows.

###### Theorem 1

Let and be the sequences generated by Algorithms 2 and 1, respectively. Suppose that Assumption 3.1 holds, and that one of the following conditions holds: (i) there exists a convergent subsequence (resp. ); (ii) is coercive111111An extended-real-valued function is called coercive if and only if as . for any ; (iii) (resp. ) is coercive. Then for any , and any finite initialization (resp. ), the following hold

1. [label=()]

2. (resp. ) converges to some (resp. ).

3. (resp. ) converges to a critical point of (resp. ).

4. at the rate where . Similarly, at the rate where .

Note that the DNN training problems (2.3) and (2.5) in this paper generally do not satisfy such a Lispchitz differentiable property, particularly, when ReLU activation is used. Compared to the existing literature, this theorem establishes the global convergence without the block multiconvex and Lipschitz differentiability assumptions used in Xu and Yin (2013), which are often violated by the DNN training problems due to the nonlinearity of the activations.

### 3.3 Extensions

We extend the established convergence results to the BCD methods for general losses with the prox-linear strategy, and the BCD methods for training ResNets.

#### 3.3.1 Extension to prox-linear

Note that in the -update of both Algorithms 2 and 1, the empirical risk is involved in the optimization problems. It is generally hard to obtain its closed-form solution except for some special cases such as the case that the loss is the squared loss. For other smooth losses such as the logistic, cross-entropy, and exponential losses, we suggest using the following prox-linear update strategies, that is, for some parameter , the -update in Algorithm 1 is

 VkN=argminVN{sN(VN)+⟨∇Rn(Vk−1N;Y),VN−Vk−1N⟩+α2∥VN−Vk−1N∥2F+γ2∥VN−Wk−1NVk−1N−1∥2F}, (3.1)

and the -update in Algorithm 2 is

 VkN=argminVN{sN(VN)+⟨∇Rn(Vk−1N;Y),VN−Vk−1N⟩+α2∥VN−Vk−1N∥2F+γ2∥VN−Uk−1N∥2F}. (3.2)

From (3.1) and (3.2), when is zero or its proximal operator can be easily computed, then -updates for both BCD algorithms can be implemented with explicit expressions. Therefore, the specific uses of these BCD methods are very flexible, mainly depending on users’ understanding of their own problems.

The claims in Theorem 1 still hold for the prox-linear updates adopted for the -updates if the loss is smooth with Lipschitz continuous gradient, as stated in the following Theorem 2.

###### Theorem 2 (Global convergence for prox-linear update)

Consider adopting the prox-linear updates (3.1), (3.2) for the -subproblems in Algorithms 2 and 1, respectively. Under the conditions of Theorem 1, if further is Lipschitz continuous with a Lipschitz constant and , then all claims in Theorem 1 still hold for both algorithms.

The proof of Theorem 2 is presented in Appendix D. It establishes the global convergence of a BCD method for the commonly used DNN training models with nonlinear losses, such as logistic or cross-entropy losses, etc. Equipped with the prox-linear strategy, all updates of BCD can be implemented easily and allow large scale distributed computations.

#### 3.3.2 Extension to ResNets Training

In this section, we first adapt the BCD method to the residual networks (ResNets) (He et al., 2016), and then extend the established convergence results of BCD to this case. Without loss of generality, similar to (2.2), we consider the following simplified ResNets training problem,

 minW,V Rn(VN;Y)+N∑i=1ri(Wi)+N∑i=1si(Vi)subject toVi−Vi−1=σi(WiVi−1),i=1,…,N. (3.3)

Since the ReLU activation is usually used in ResNets, we only consider the three-splitting formulation of (3.3):

 minW,V,U Rn(VN;Y)+N∑i=1ri(Wi)+N∑i=1si(Vi)% subject toUi=WiVi−1, Vi−Vi−1=σi(Ui), i=1,…,N,

and then adapt BCD to the following minimization problem,

 (3.4)

where , , as defined before, and

 ¯¯¯¯Lres(W,V,U):=Rn(VN;Y)+N∑i=1ri(Wi)+N∑i=1si(Vi)+γ2N∑i=1[∥Vi−Vi−1−σi(Ui)∥2F+∥Ui−WiVi−1∥2F].

When applied to (3.4), we use the same update order of Algorithm 2 but slightly change the subproblems according to the objective in (3.4). The specific BCD algorithm for ResNets is presented in Algorithm 3 in Appendix D.

Similarly, we establish the convergence of BCD for the DNN training model with ResNets (3.4) as follows.

###### Theorem 3 (Convergence of BCD for ResNets)

Let be a sequence generated by BCD for the DNN training model with ResNets (i.e., Algorithm 3). Let assumptions of Theorem 1 hold. Then all claims in Theorem 1 still hold for BCD with ResNets via replacing with .

Moreover, consider adopting the prox-linear update for the -subproblem in Algorithm 3, then under the assumptions of Theorem 2, all claims of Theorem 2 still hold for Algorithm 3.

The proof of this theorem is presented in Appendix D. ResNets is one of the most popular network architectures used in the deep learning community and has profound applications in computer vision. How to efficiently train ResNets is thus very important, especially since it is not of a fully-connected structure. The theorem, for the first time, shows that the BCD method might be a good candidate for the training of ResNets with global convergence guarantee.

## 4 Key stones and discussions

In this section, we present the keystones of our proofs followed by some discussions.

### 4.1 Main ideas of proofs

Our proofs follow the analysis framework formulated in Attouch et al. (2013), where the establishments of the sufficient descent and relative error conditions and the verifications of the continuity condition and property of the objective function are the four key ingredients. In order to establish the sufficient descent and relative error properties, two kinds of assumptions, namely, (a) multiconvexity and differentiability assumption, and (b) (blockwise) Lipschitz differentiability assumption on the unregularized part of objective function are commonly used in the literature, where Xu and Yin (2013) mainly used assumption (a), and the literature (Attouch et al., 2013; Xu and Yin, 2017; Bolte et al., 2014) mainly used assumption (b). Note that in our cases, the unregularized part of in (2.3),

 Rn(VN;Y)+γ2N∑i=1∥Vi−σi(WiVi−1)∥2F,

and that of in (2.5),

 Rn(VN;Y)+γ2N∑i=1[∥Vi−σi(Ui)∥2F+∥Ui−WiVi−1∥2F]

usually do not satisfy any of the block multiconvexity and differentiability assumption (i.e., assumption (a)), and the blockwise Lipschitz differentiability assumption (i.e., assumption (b)). For instance, when is ReLU or leaky ReLU, the functions and are non-differentiable and nonconvex with respect to -block and -block, respectively.

In order to overcome these challenges, in this paper, we first exploit the proximal strategies for all the non-strongly convex subproblems (see Algorithm 2) to cheaply obtain the desired sufficient descent property (see Lemma 1), and then take advantage of the Lipschitz continuity of the activations as well as the specific splitting formulations to yield the desired relative error property (see Lemma 2). Below we present these two key lemmas, while leaving other details in Appendix (where the verification of the KŁ property for the concerned DNN training models satisfying Section 3.1 can be referred to Proposition 2 in Section C.1, and the verification of the continuity condition is shown by (C.19) in Section C.3.2). Based on Lemmas 2 and 1, Proposition 2 and (C.19), we prove Theorem 1 according to Attouch et al. (2013, Theorem 2.9), with details shown in Appendix C.

### 4.2 Sufficient descent lemma

We state the established sufficient descent lemma as follows.

###### Lemma 1 (Sufficient descent)

Let be a sequence generated by the BCD method (Algorithm 2), under assumptions of Theorem 1, then

 ¯¯¯¯L(Pk)≤¯¯¯¯L(Pk−1)−a∥Pk−Pk−1∥2F, (4.1)

for some constant specified in the proof.

From Lemma 1, the Lagrangian sequence is monotonically decreasing, and the descent quantity of each iterate can be lower bounded by the discrepancy between the current iterate and its previous iterate. This lemma is crucial for the global convergence of a nonconvex algorithm. It tells at least the following four important items: (i) is convergent if is lower bounded; (ii) itself is bounded if is coercive and is finite; (iii) is square summable, i.e., , implying its asymptotic regularity, i.e., as ; and (iv) at the rate of . Leveraging on Lemma 1, we can establish the global convergence (i.e., the whole sequence convergence) of BCD in DNN training settings. In contrast, Davis et al. (2019) only establish the subsequence convergence of SGD in DNN training settings. Such a gap between the subsequence convergence of SGD in Davis et al. (2019) and the whole sequence convergence of BCD in this paper exists mainly because SGD can only achieve the descent property but not the sufficient descent property.

It can be noted from Lemma 1 that neither multiconvexity and differentiability nor Lipschitz differentiability assumption is imposed on the DNN training models to yield this lemma, as required in the literature (Xu and Yin, 2013; Attouch et al., 2013; Xu and Yin, 2017; Bolte et al., 2014). Instead, we mainly exploit the proximal strategy for all non-strongly convex subproblems in Algorithm 2 to establish this lemma.

### 4.3 Relative error lemma

We now present the obtained relative error lemma.

###### Lemma 2 (Relative error)

Under the conditions of Theorem 1, let be an upper bound of and for any positive integer , be a uniform Lipschitz constant of on the bounded set . Then for any positive integer , it holds that,

 ∥¯gk∥F≤¯b∥Pk−Pk−1∥F,¯gk∈∂¯¯¯¯L(Pk)

for some constant specified later in the proof, where

 ∂¯¯¯¯L(Pk):=({∂Wi¯¯¯¯L}Ni=1,{∂Vi¯¯¯¯L}Ni=1,{∂Ui¯¯¯¯L}Ni=1)(Pk).

Lemma 2 shows that the subgradient sequence of the Lagrangian is upper bounded by the discrepancy between the current and previous iterates. Together with the asymptotic regularity of yielded by Lemma 1, Lemma 2 shows the critical point convergence. Also, together with the claim (iv) implied by Lemma 1, namely, the rate of convergence of , Lemma 2 yields the rate of convergence (to a critical point) of BCD, i.e., at the rate of .

From Lemma 2, both differentiability and (blockwise) Lipschitz differentiability assumptions are not imposed. Instead, we only use the Lipschitz continuity (on any bounded set) of the activations, which is a very mild and natural condition satisfied by most commonly used activation functions. In order to achieve this lemma, we also need to do some special treatments on the specific updates of BCD algorithms as demonstrated in Section C.3.1.

## 5 Conclusion

The empirical efficiency of BCD methods in deep neural network (DNN) training has been demonstrated in the literature. However, the theoretical understanding of their convergence is still very limited and it lacks a general framework due to the fact that DNN training is a highly nonconvex problem. In this paper, we fill this void by providing a general methodology to establish the global convergence of the BCD methods for a class of DNN training models, which encompasses most of the commonly used BCD methods in the literature as special cases. Under some mild assumptions, we establish the global convergence at a rate of , with being the number of iterations, to a critical point of the DNN training models with several variable splittings. Our theory is also extended to residual networks with general losses which have Lipschitz continuous gradients. Such work may lay down a theoretical foundation of BCD methods in their applications to deep learning.

## Acknowledgment

The work of Jinshan Zeng is supported in part by the National Natural Science Foundation (NNSF) of China (No. 61603162, 61876074). The work of Yuan Yao is supported in part by HKRGC grant 16303817, 973 Program of China (No. 2015CB85600), NNSF of China (No. 61370004, 11421110001), as well as grants from Tencent AI Lab, Si Family Foundation, Baidu BDI, and Microsoft Research-Asia. The work of Shaobo Lin is supported in part by the NNSF of China (No. 61876133). The work of Tim Tsz-Kit Lau is supported by the University Fellowship funded by The Graduate School at Northwestern University, and part of this work was performed when he was at Department of Mathematics, The Hong Kong University of Science and Technology.

## Appendix A Implementations on BCD methods

In this section, we provide several remarks to discuss the specific implementations of BCD methods.

###### Remark 1 (On the initialization of parameters)

In practice, the weights

are generally initialized according to some Gaussian distributions with small standard deviations. The bias vectors are usually set as all one vectors scaling by some small constants. Given the weights and bias vectors, the auxiliary variables

and state variables are usually initialized by a single forward pass through the network.

###### Remark 2 (On the update order)

We suggest such a backward update order in this paper due to the nested structure of DNNs. Besides the update order presented in Algorithm 2, any arbitrary deterministic update order can be incorporated into the BCD methods, and our proofs still go through.

###### Remark 3 (On the distributed implementation)

One major advantage of BCD is that it can be implemented in distributed and parallel manner like in ADMM. Specifically, given servers, the total training data are distributed to these servers. Denotes as the subset of samples at server . Thus, , where denotes the cardinality of . For each layer , the state variable is divided into submatrices by column, that is, , where denotes the submatrix of including all the columns in the index set . The auxiliary variables ’s are decomposed similarly. From Algorithm 2, the updates of and do not need any communication and thus, can be computed in a parallel way. The difficult part is the update of weight , which is generally hard to parallelize. To deal with this part, there are some effective strategies suggested in the literature like Taylor et al. (2016).

## Appendix B Proof of Proposition 1

###### Proof

We verify these special cases as follows.

On the loss function : Since these losses are all nonnegative and continuous on their domains, thus, they are proper lower semicontinuous and lower bounded by 0. In the following, we only verify that they are either real analytic or semialgebraic.

1. [label=(a0)]

2. If is the squared () or exponential (), then according to Krantz and Parks (2002), they are real analytic.

3. If is the logistic loss (), since it is a composition of logarithm and exponential functions which both are real analytic, thus according to Lemma 3, the logistic loss is real analytic.

4. If is the cross-entropy loss, that is, given , , where is performed elementwise and for any , which can be viewed as a linear combination of logistic functions, then by 2 and Lemma 3, it is also real analytic.

5. If is the hinge loss, that is, given , for any , by Lemma 4(1), it is semialgebraic, because its graph is , a closure of the set , where

 D={(u,z):1−u⊤y−z=0,1−u≻0}∪{(u,z):z=0,u⊤y−1>0}.

On the activation function : Since all the considered specific activations are continuous on their domains, they are Lipschitz continuous on any bounded set. In the following, we only need to check that they are either real analytic or semialgebraic.

1. [label=(b0)]

2. If is a linear or polynomial function, then according to Krantz and Parks (2002), is real analytic.

3. If is sigmoid, , or hyperbolic tangent, , then the sigmoid function is a composition of these two functions where and (resp. and in the hyperbolic tangent case). According to Krantz and Parks (2002), and in both cases are real analytic. Thus, according to Lemma 3, sigmoid and hyperbolic tangent functions are real analytic.

4. If is ReLU, i.e.,