1 Introduction
Tremendous research activities have been dedicated to deep learning due to its great success in some realworld applications such as image classification in computer vision
(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012; Sainath et al., 2013), statistical machine translation (Devlin et al., 2014), and especially outperforming human in Go games (Silver et al., 2016).The practical optimization algorithms for training neural networks can be mainly divided into three categories in terms of the amount of first and secondorder information used, namely, gradientbased, (approximate) secondorder and gradientfree methods. Gradientbased methods make use of backpropagation
(Rumelhart et al., 1986)to compute gradients of network parameters. Stochastic gradient descent (SGD) method proposed by
Robbins and Monro (1951) is the basis. Much of research endeavour is dedicated to adaptive variants of vanilla SGD in recent years, including AdaGrad (Duchi et al., 2011), RMSProp
(Tieleman and Hinton, 2012), Adam (Kingma and Ba, 2015) and AMSGrad (Reddi et al., 2018). (Approximate) secondorder methods mainly include Newton’s method (LeCun et al., 2012), LBFGS and conjugate gradient (Le et al., 2011). Despite the great success of these gradient based methods, they may suffer from the vanishing gradient issue for training deep networks (Goodfellow et al., 2016). As an alternative to overcome this issue, gradientfree methods have been recently adapted to the DNN training, including (but not limited to) block coordinate descent (BCD) methods (CarreiraPerpiñán and Wang, 2014; Zhang and Brand, 2017; Lau et al., 2018; Askari et al., 2018; Gu et al., 2018) and alternating direction method of multipliers (ADMM) (Taylor et al., 2016; Zhang et al., 2016). The main reasons for the surge of attention of these two algorithms are twofold. One reason is that they are gradientfree, thus able to deal with nondifferentiable nonlinearities and potentially avoid the vanishing gradient issue (Taylor et al., 2016; Zhang and Brand, 2017). As shown in Figure 1, it is observed that vanilla SGD fails to train a tenhiddenlayer MLPs while BCD still works and achieves a moderate accuracy within a few epochs. The other reason is that BCD and ADMM can be easily implemented in a distributed and parallel manner
(Boyd et al., 2011; Mahajan et al., 2017), therefore in favour of distributed/decentralized scenarios.The BCD methods currently adopted in DNN training run into two categories depending on the specific formulations of the objective functions, namely, the twosplitting formulation and threesplitting formulation (shown in (2.2) and (2.4)), respectively. Examples of the twosplitting formulation include CarreiraPerpiñán and Wang (2014); Zhang and Brand (2017); Askari et al. (2018); Gu et al. (2018), whilst Taylor et al. (2016); Lau et al. (2018) adopt the threesplitting formulation. Convergence studies of BCD methods appeared recently in some restricted settings. In Zhang and Brand (2017)
, a BCD method was suggested to solve the Tikhonov regularized deep neural network training problem using a lifting trick to avoid the computational hurdle imposed by ReLU. Its convergence was established through the framework of
Xu and Yin (2013), where the block multiconvexity^{2}^{2}2A function with multiblock variables is called block multiconvex if it is convex with respect to each block variable when fixing the other blocks, and is called blockwise Lipschitz differentiable if it is differentiable with respect to each block variable and its gradient is Lipschitz continuous while fixing the others. and differentiability of the unregularized part of the objective function play central roles in the analysis. However, for other commonly used activations such as sigmoid, the convergence analysis of Xu and Yin (2013) cannot be directly applied since the multiconvexity may be violated. Askari et al. (2018) and Gu et al. (2018) extended the lifting trick introduced by Zhang and Brand (2017) to deal with a class of strictly increasing and invertible activations, and then adapted BCD methods to solve the lifted DNN training models. However, no convergence guarantee was provided in both Askari et al. (2018) and Gu et al. (2018). Following the similar lifting trick as in Zhang and Brand (2017), Lau et al. (2018) proposed a proximal BCD based on the threesplitting formulation of the regularized DNN training problem with ReLU activation. The global convergence was also established through the analysis framework of Xu and Yin (2013). However, similar convergence results for other commonly used activation functions are still lacking.In this paper, we aim to fill these gaps. Our main contribution is to provide a general methodology to establish the global convergence^{3}^{3}3Global convergence used in this paper means that the algorithm converges starting from any finite initialization. of these BCD methods in the common DNN training settings, without requiring the block multiconvexity and differentiability assumptions as in Xu and Yin (2013). Instead, our key assumption is the Lipschitz continuity of the activation on any bounded set (see Section 3.1(b)). Specifically, Theorem 1 establishes the global convergence to a critical point at a rate of the BCD methods using the proximal strategy, while extensions to the proxlinear strategy for general losses are provided in Theorem 2 and to residual networks (ResNets) are shown in Theorem 3. Our assumptions are applicable to most cases appeared in the literature. Specifically in Theorem 1, if the loss function, activations, and convex regularizers are lower semicontinuous and either realanalytic (see Definition 1) or semialgebraic (see Definition 2), and the activations are Lipschitz continuous on any bounded set, then BCD converges to a critical point at the rate of starting from any finite initialization, where is the iteration number. Note that these assumptions are satisfied by most commonly used DNN training models, where (a) the loss function can be any of the squared, logistic, hinge, exponential or crossentropy losses, (b) the activation function can be any of ReLU, leaky ReLU, sigmoid, tanh, linear, or polynomial functions, and (c) the regularizer can be any of the squared norm, squared Frobenius norm, the elementwise norm, or the sum of squared Frobenuis norm and elementwise
norm (say, in the vector case, the elastic net by
Zou and Hastie, 2005), or the indicator function of the nonnegative closed half space or a closed interval (see Proposition 1).Our analysis is based on the KurdykaŁojasiewicz (KŁ) inequality (Łojasiewicz, 1993; Kurdyka, 1998) framework formulated in Attouch et al. (2013). However there are several different treatments compared to the stateoftheart work (Xu and Yin, 2013) that enables us to achieve the general convergence guarantee aforementioned. According to Attouch et al. (2013, Theorem 2.9), the sufficient descent, relative error and continuity conditions, together with the KŁ assumption yield the global convergence of a nonconvex algorithm. In order to obtain the sufficient descent condition, we exploit the proximal strategy for all nonstrongly convex subproblems (see Algorithm 2 and Lemma 1), without requiring block multiconvex assumption used in Xu and Yin (2013, Lemma 2.6). In order to establish the relative error condition, we use the Lipschitz continuity of the activation functions and do some careful treatments on the specific updates of the BCD methods (see Lemma 2), without requiring the (locally) Lipschitz differentiability of the unregularized part as used in Xu and Yin (2013, Lemma 2.6). The continuity condition is established via the lower semicontinuity assumptions of the loss, activations and regularizers. The treatments of this paper are of their own value to the optimization community. The detailed comparisons between this paper and the existing literature can be found in Section 4.
The rest of this paper is organized as follows. Section 2 describes the BCD methods when adapted to the splitting formulations of DNN training problems. Section 3 establishes their global convergence results, followed by some extensions. Section 4 illustrates the key ideas of proof with some discussions. We conclude this paper in Section 5.
2 DNN training via BCD
In this section, we describe the specific forms of BCD involving both two and threesplitting formulations.
2.1 DNN training with variable splitting
Consider layer feedforward neural networks including hidden layers of the neural networks. Particularly, let be the number of hidden units in the th hidden layer for . Let and be the number of units of input and output layers, respectively. Let be the weight matrix between the th layer and the th layer for any ^{4}^{4}4To simplify notations, we regard the input and output layers as the th and th layers, respectively, and absorb the bias of each layer into .. Let be samples, where ’s are the onehot vectors of labels. Denote , and . With the help of these notations, the DNN training problem can be formulated as the following empirical risk minimization:
(2.1) 
where , is some loss function, is the neural network model with layers and weights and is the activation function of the th layer (generally, , i.e., the identity function) and is called empirical risk (also known as the training loss).
Note that the DNN training model (2.1) is highly nonconvex as the variables are coupled via the deep neural network architecture, which brings many challenges for the design of efficient training algorithms and also its theoretical analysis. To make Problem (2.1) more computationally tractable, variable splitting is one of the most commonly used ways (Taylor et al., 2016; Zhang and Brand, 2017; Askari et al., 2018; Gu et al., 2018; Lau et al., 2018). The main idea of variable splitting is to transform a complicated problem (where the variables are coupled highly nonlinearly) into a relatively simpler one (where the variables are coupled much looser) via introducing some additional variables.
2.1.1 Twosplitting formulation.
Considering general deep neural network architectures, the DNN training problem can be naturally formulated as the following model (called twosplitting formulation)^{5}^{5}5Here we consider the regularized DNN training model. The model reduces to the original DNN training model (2.1) without regularization. :
(2.2) 
where denotes the empirical risk, , is the th column of . In addition, and are extendedrealvalued, nonnegative functions revealing the priors of the weight variable and the state variable (or the constraints on and ) for each , and define . In order to solve the twosplitting formulation (2.2), the following alternative minimization problem was suggested in the literature:
(2.3) 
where
is a hyperparameter
^{6}^{6}6In (2.5), we use a uniform hyperparameter for the sum of all quadratic terms for the simplicity of notation. In practice, can be different for each quadratic term and our proof still goes through..The DNN training model (2.2) can be very general, where: (a) can be the squared, logistic, hinge, crossentropy or other commonly used loss functions; (b) can be ReLU, leaky ReLU, sigmoid, linear, or other commonly used activation functions; (c) can be the squared norm, the norm, the elastic net (Zou and Hastie, 2005), the indicator function of some nonempty closed convex set^{7}^{7}7The indicator function of a nonempty convex set is defined as if and otherwise. (such as the nonnegative closed half space or a closed interval ), or others; (d) can be the norm (Ji et al., 2014), the indicator function of some convex set with simple projection (Zhang and Brand, 2017), or others. Particularly, if there is no regularizer or constraint on (or ), then (or ) can be zero.
The network architectures considered in this paper exhibit generality to various types of DNNs, including but not limited to the fully (or sparse) connected MLPs (Rosenblatt, 1961), convolutional neural networks (CNNs) (Fukushima, 1980; LeCun et al., 1998) and residual neural networks (ResNets) (He et al., 2016). For CNNs, the weight matrix is sparse and shares some symmetry structures represented as permutation invariants, which are linear constraints and up to a linear reparameterization all the main results below are still valid.
Various existing BCD algorithms for DNN training (CarreiraPerpiñán and Wang, 2014; Zhang and Brand, 2017; Askari et al., 2018; Gu et al., 2018) can be regarded as special cases in terms of the use of the twosplitting formulation (2.2). In fact, CarreiraPerpiñán and Wang (2014) considered a specific DNN training model with squared loss and sigmoid activation function, and proposed the method of auxiliary coordinate (MAC) based on the twosplitting formulation of DNN training (2.2), as a twoblock BCD method with the weight variables as one block and the state variables as the other. For each block, a nonlinear least squares problem is solved by some iterative methods. Furthermore, Zhang and Brand (2017) proposed a BCD type method for DNN training with ReLU and squared loss. To avoid the computational hurdle imposed by ReLU, the DNN training model was relaxed to a smooth multiconvex formulation via lifting ReLU into a higher dimensional space (Zhang and Brand, 2017). Such a relaxed BCD is in fact a special case of twosplitting formulation (2.3) with , , where is the nonnegative closed halfspace with the same dimension of , while Askari et al. (2018) and Gu et al. (2018) extended such lifting trick to more general DNN training settings, of which the activation can be not only ReLU, but also sigmoid and leaky ReLU. The general formulations studied in these two papers are also special cases of the twosplitting formulation with different and .
2.1.2 Threesplitting formulation.
Note that the variables and are coupled by the nonlinear activation function in the th constraint of the twosplitting formulation (2.2), which may bring some difficulties and challenges for solving problem (2.2) efficiently, particularly, when the activation function is ReLU. Instead, the following threesplitting formulation was used in Taylor et al. (2016); Lau et al. (2018):
(2.4) 
where . From (2.4), the variables are coupled much more loosely, particularly for variables and . As described later, such a threesplitting formulation can be beneficial to designing some more efficient methods, though auxiliary variables ’s are introduced. Similarly, the following alternative unconstrained problem was suggested in the literature:
(2.5) 
2.2 Description of BCD algorithms
In the following, we describe how to adapt the BCD method to Problems (2.3) and (2.5). The main idea of the BCD method of GaussSeidel type for a minimization problem with multiblock variables is to update all the variables cyclically while fixing the remaining blocks at their last updated values (Xu and Yin, 2013). In this paper, we consider the BCD method with the backward order (but not limited to this as discussed later) for the updates of variables, that is, the variables are updated from the output layer to the input layer, and for each layer, we update the variables cyclically for Problem (2.3) as well as the variables cyclically for Problem (2.5). Since , the output layer is paid special attention. Particularly, for most blocks, we adopt the proximal update strategies for two major reasons: (1) To practically stabilize the training process; (2) To yield the desired “sufficient descent” property for theoretical justification. For each subproblem, we assume that its minimizer can be achieved. The BCD algorithms for Problems (2.3) and (2.5) can be summarized in Algorithms 2 and 1, respectively.
One major merit of Algorithm 2 over Algorithm 1 is that in each subproblem, almost all updates are simple proximal updates^{9}^{9}9For update, we can regard as a new proximal function . (or just least squares problems), which usually have closed form solutions to many commonly used DNNs, while a drawback of Algorithm 2 over Algorithm 1 is that more storage memory is required due to the introduction of additional variables . Some typical examples leading to the closed form solutions include: (a) are (i.e., no regularization), or the squared norm (a.k.a. weight decay), or the indicator function of a nonempty closed convex set with a simple projection like the nonnegative closed half space and the closed interval ; (b) the loss function is the squared loss or hinge loss (see Lemma 14 in Section E.2); and (c) is ReLU (see Lemma 13 in Section E.1), leaky ReLU, or linear link function. For other cases in which and are the norm,
is the sigmoid function, and the loss
is the logistic function, the associated subproblems can be also solved cheaply via some efficient existing methods. Discussions on specific implementations of these BCD methods can be referred to Appendix A.3 Global convergence analysis of BCD
In this section, we establish the global convergence of both Algorithm 1 for Problem (2.3), and Algorithm 2 for Problem (2.5), followed by some extensions.
3.1 Main assumptions
First of all, we present our main assumptions, which involve the definitions of real analytic and semialgebraic functions.
Let be an extendedrealvalued function (respectively, be a pointtoset mapping), its graph is defined by
and its domain by (resp. ). When is a proper function, i.e., when the set of its global minimizers (possibly empty) is denoted by
Definition 1 (Real analytic)
A function with domain an open set and range the set of either all real or complex numbers, is said to be real analytic at if the function may be represented by a convergent power series on some interval of positive radius centered at : for some . The function is said to be real analytic on if it is real analytic at each (Krantz and Parks, 2002, Definition 1.1.5). The real analytic function over for some positive integer can be defined similarly.
According to Krantz and Parks (2002), typical real analytic functions include polynomials, exponential functions, and the logarithm, trigonometric and power functions on any open set of their domains. One can verify whether a multivariable real function on is analytic by checking the analyticity of for any .
Definition 2 (Semialgebraic)

A set is called semialgebraic (Bochnak et al., 1998) if it can be represented as
where are real polynomial functions for

A function (resp. a pointtoset mapping ) is called semialgebraic if its graph is semialgebraic.
According to Łojasiewicz (1965); Bochnak et al. (1998) and Shiota (1997, I.2.9, page 52), the class of semialgebraic sets is stable under the operation of finite union, finite intersection, Cartesian product or complementation. Some typical examples include polynomial functions, the indicator function of a semialgebraic set, and the Euclidean norm (Bochnak et al., 1998, page 26).
Assumption
Suppose that

the loss function is a proper lower semicontinuous^{10}^{10}10A function is called lower semicontinuous if for any . and nonnegative function,

the activation functions () are Lipschitz continuous on any bounded set,

the regularizers and () are nonegative lower semicontinuous convex functions, and

all these functions , , and () are either real analytic or semialgebraic, and continuous on their domains.
According to Krantz and Parks (2002); Łojasiewicz (1965); Bochnak et al. (1998) and Shiota (1997, I.2.9, page 52), most of the commonly used DNN training models (2.2) can be verified to satisfy Section 3.1 as shown in the following proposition, the proof of which is provided in Appendix B.
Proposition 1
Examples satisfying Section 3.1 include:

is the squared, logistic, hinge, or crossentropy losses;

is ReLU, leaky ReLU, sigmoid, hyperbolic tangent, linear, or polynomial activations;

and are the squared norm, the norm, the elastic net, the indicator function of some nonempty closed convex set (such as the nonnegative closed half space, box set or a closed interval ), or if no regularization.
3.2 Main theorem
Under Assumption 3.1, we state our main theorem as follows.
Theorem 1
Let and be the sequences generated by Algorithms 2 and 1, respectively. Suppose that Assumption 3.1 holds, and that one of the following conditions holds: (i) there exists a convergent subsequence (resp. ); (ii) is coercive^{11}^{11}11An extendedrealvalued function is called coercive if and only if as . for any ; (iii) (resp. ) is coercive. Then for any , and any finite initialization (resp. ), the following hold

[label=()]

(resp. ) converges to some (resp. ).

(resp. ) converges to a critical point of (resp. ).

at the rate where . Similarly, at the rate where .
Note that the DNN training problems (2.3) and (2.5) in this paper generally do not satisfy such a Lispchitz differentiable property, particularly, when ReLU activation is used. Compared to the existing literature, this theorem establishes the global convergence without the block multiconvex and Lipschitz differentiability assumptions used in Xu and Yin (2013), which are often violated by the DNN training problems due to the nonlinearity of the activations.
3.3 Extensions
We extend the established convergence results to the BCD methods for general losses with the proxlinear strategy, and the BCD methods for training ResNets.
3.3.1 Extension to proxlinear
Note that in the update of both Algorithms 2 and 1, the empirical risk is involved in the optimization problems. It is generally hard to obtain its closedform solution except for some special cases such as the case that the loss is the squared loss. For other smooth losses such as the logistic, crossentropy, and exponential losses, we suggest using the following proxlinear update strategies, that is, for some parameter , the update in Algorithm 1 is
(3.1) 
and the update in Algorithm 2 is
(3.2) 
From (3.1) and (3.2), when is zero or its proximal operator can be easily computed, then updates for both BCD algorithms can be implemented with explicit expressions. Therefore, the specific uses of these BCD methods are very flexible, mainly depending on users’ understanding of their own problems.
The claims in Theorem 1 still hold for the proxlinear updates adopted for the updates if the loss is smooth with Lipschitz continuous gradient, as stated in the following Theorem 2.
Theorem 2 (Global convergence for proxlinear update)
The proof of Theorem 2 is presented in Appendix D. It establishes the global convergence of a BCD method for the commonly used DNN training models with nonlinear losses, such as logistic or crossentropy losses, etc. Equipped with the proxlinear strategy, all updates of BCD can be implemented easily and allow large scale distributed computations.
3.3.2 Extension to ResNets Training
In this section, we first adapt the BCD method to the residual networks (ResNets) (He et al., 2016), and then extend the established convergence results of BCD to this case. Without loss of generality, similar to (2.2), we consider the following simplified ResNets training problem,
(3.3) 
Since the ReLU activation is usually used in ResNets, we only consider the threesplitting formulation of (3.3):
and then adapt BCD to the following minimization problem,
(3.4) 
where , , as defined before, and
When applied to (3.4), we use the same update order of Algorithm 2 but slightly change the subproblems according to the objective in (3.4). The specific BCD algorithm for ResNets is presented in Algorithm 3 in Appendix D.
Similarly, we establish the convergence of BCD for the DNN training model with ResNets (3.4) as follows.
Theorem 3 (Convergence of BCD for ResNets)
Let be a sequence generated by BCD for the DNN training model with ResNets (i.e., Algorithm 3). Let assumptions of Theorem 1 hold. Then all claims in Theorem 1 still hold for BCD with ResNets via replacing with .
Moreover, consider adopting the proxlinear update for the subproblem in Algorithm 3, then under the assumptions of Theorem 2, all claims of Theorem 2 still hold for Algorithm 3.
The proof of this theorem is presented in Appendix D. ResNets is one of the most popular network architectures used in the deep learning community and has profound applications in computer vision. How to efficiently train ResNets is thus very important, especially since it is not of a fullyconnected structure. The theorem, for the first time, shows that the BCD method might be a good candidate for the training of ResNets with global convergence guarantee.
4 Key stones and discussions
In this section, we present the keystones of our proofs followed by some discussions.
4.1 Main ideas of proofs
Our proofs follow the analysis framework formulated in Attouch et al. (2013), where the establishments of the sufficient descent and relative error conditions and the verifications of the continuity condition and KŁ property of the objective function are the four key ingredients. In order to establish the sufficient descent and relative error properties, two kinds of assumptions, namely, (a) multiconvexity and differentiability assumption, and (b) (blockwise) Lipschitz differentiability assumption on the unregularized part of objective function are commonly used in the literature, where Xu and Yin (2013) mainly used assumption (a), and the literature (Attouch et al., 2013; Xu and Yin, 2017; Bolte et al., 2014) mainly used assumption (b). Note that in our cases, the unregularized part of in (2.3),
and that of in (2.5),
usually do not satisfy any of the block multiconvexity and differentiability assumption (i.e., assumption (a)), and the blockwise Lipschitz differentiability assumption (i.e., assumption (b)). For instance, when is ReLU or leaky ReLU, the functions and are nondifferentiable and nonconvex with respect to block and block, respectively.
In order to overcome these challenges, in this paper, we first exploit the proximal strategies for all the nonstrongly convex subproblems (see Algorithm 2) to cheaply obtain the desired sufficient descent property (see Lemma 1), and then take advantage of the Lipschitz continuity of the activations as well as the specific splitting formulations to yield the desired relative error property (see Lemma 2). Below we present these two key lemmas, while leaving other details in Appendix (where the verification of the KŁ property for the concerned DNN training models satisfying Section 3.1 can be referred to Proposition 2 in Section C.1, and the verification of the continuity condition is shown by (C.19) in Section C.3.2). Based on Lemmas 2 and 1, Proposition 2 and (C.19), we prove Theorem 1 according to Attouch et al. (2013, Theorem 2.9), with details shown in Appendix C.
4.2 Sufficient descent lemma
We state the established sufficient descent lemma as follows.
Lemma 1 (Sufficient descent)
Let be a sequence generated by the BCD method (Algorithm 2), under assumptions of Theorem 1, then
(4.1) 
for some constant specified in the proof.
From Lemma 1, the Lagrangian sequence is monotonically decreasing, and the descent quantity of each iterate can be lower bounded by the discrepancy between the current iterate and its previous iterate. This lemma is crucial for the global convergence of a nonconvex algorithm. It tells at least the following four important items: (i) is convergent if is lower bounded; (ii) itself is bounded if is coercive and is finite; (iii) is square summable, i.e., , implying its asymptotic regularity, i.e., as ; and (iv) at the rate of . Leveraging on Lemma 1, we can establish the global convergence (i.e., the whole sequence convergence) of BCD in DNN training settings. In contrast, Davis et al. (2019) only establish the subsequence convergence of SGD in DNN training settings. Such a gap between the subsequence convergence of SGD in Davis et al. (2019) and the whole sequence convergence of BCD in this paper exists mainly because SGD can only achieve the descent property but not the sufficient descent property.
It can be noted from Lemma 1 that neither multiconvexity and differentiability nor Lipschitz differentiability assumption is imposed on the DNN training models to yield this lemma, as required in the literature (Xu and Yin, 2013; Attouch et al., 2013; Xu and Yin, 2017; Bolte et al., 2014). Instead, we mainly exploit the proximal strategy for all nonstrongly convex subproblems in Algorithm 2 to establish this lemma.
4.3 Relative error lemma
We now present the obtained relative error lemma.
Lemma 2 (Relative error)
Under the conditions of Theorem 1, let be an upper bound of and for any positive integer , be a uniform Lipschitz constant of on the bounded set . Then for any positive integer , it holds that,
for some constant specified later in the proof, where
Lemma 2 shows that the subgradient sequence of the Lagrangian is upper bounded by the discrepancy between the current and previous iterates. Together with the asymptotic regularity of yielded by Lemma 1, Lemma 2 shows the critical point convergence. Also, together with the claim (iv) implied by Lemma 1, namely, the rate of convergence of , Lemma 2 yields the rate of convergence (to a critical point) of BCD, i.e., at the rate of .
From Lemma 2, both differentiability and (blockwise) Lipschitz differentiability assumptions are not imposed. Instead, we only use the Lipschitz continuity (on any bounded set) of the activations, which is a very mild and natural condition satisfied by most commonly used activation functions. In order to achieve this lemma, we also need to do some special treatments on the specific updates of BCD algorithms as demonstrated in Section C.3.1.
5 Conclusion
The empirical efficiency of BCD methods in deep neural network (DNN) training has been demonstrated in the literature. However, the theoretical understanding of their convergence is still very limited and it lacks a general framework due to the fact that DNN training is a highly nonconvex problem. In this paper, we fill this void by providing a general methodology to establish the global convergence of the BCD methods for a class of DNN training models, which encompasses most of the commonly used BCD methods in the literature as special cases. Under some mild assumptions, we establish the global convergence at a rate of , with being the number of iterations, to a critical point of the DNN training models with several variable splittings. Our theory is also extended to residual networks with general losses which have Lipschitz continuous gradients. Such work may lay down a theoretical foundation of BCD methods in their applications to deep learning.
Acknowledgment
The work of Jinshan Zeng is supported in part by the National Natural Science Foundation (NNSF) of China (No. 61603162, 61876074). The work of Yuan Yao is supported in part by HKRGC grant 16303817, 973 Program of China (No. 2015CB85600), NNSF of China (No. 61370004, 11421110001), as well as grants from Tencent AI Lab, Si Family Foundation, Baidu BDI, and Microsoft ResearchAsia. The work of Shaobo Lin is supported in part by the NNSF of China (No. 61876133). The work of Tim TszKit Lau is supported by the University Fellowship funded by The Graduate School at Northwestern University, and part of this work was performed when he was at Department of Mathematics, The Hong Kong University of Science and Technology.
Appendix
Appendix A Implementations on BCD methods
In this section, we provide several remarks to discuss the specific implementations of BCD methods.
Remark 1 (On the initialization of parameters)
In practice, the weights
are generally initialized according to some Gaussian distributions with small standard deviations. The bias vectors are usually set as all one vectors scaling by some small constants. Given the weights and bias vectors, the auxiliary variables
and state variables are usually initialized by a single forward pass through the network.Remark 2 (On the update order)
We suggest such a backward update order in this paper due to the nested structure of DNNs. Besides the update order presented in Algorithm 2, any arbitrary deterministic update order can be incorporated into the BCD methods, and our proofs still go through.
Remark 3 (On the distributed implementation)
One major advantage of BCD is that it can be implemented in distributed and parallel manner like in ADMM. Specifically, given servers, the total training data are distributed to these servers. Denotes as the subset of samples at server . Thus, , where denotes the cardinality of . For each layer , the state variable is divided into submatrices by column, that is, , where denotes the submatrix of including all the columns in the index set . The auxiliary variables ’s are decomposed similarly. From Algorithm 2, the updates of and do not need any communication and thus, can be computed in a parallel way. The difficult part is the update of weight , which is generally hard to parallelize. To deal with this part, there are some effective strategies suggested in the literature like Taylor et al. (2016).
Appendix B Proof of Proposition 1
Proof
We verify these special cases as follows.
On the loss function : Since these losses are all nonnegative and continuous on their domains, thus, they are proper lower semicontinuous and lower bounded by 0. In the following, we only verify that they are either real analytic or semialgebraic.

[label=(a0)]

If is the squared () or exponential (), then according to Krantz and Parks (2002), they are real analytic.

If is the logistic loss (), since it is a composition of logarithm and exponential functions which both are real analytic, thus according to Lemma 3, the logistic loss is real analytic.

If is the hinge loss, that is, given , for any , by Lemma 4(1), it is semialgebraic, because its graph is , a closure of the set , where
On the activation function : Since all the considered specific activations are continuous on their domains, they are Lipschitz continuous on any bounded set. In the following, we only need to check that they are either real analytic or semialgebraic.

[label=(b0)]

If is a linear or polynomial function, then according to Krantz and Parks (2002), is real analytic.

If is sigmoid, , or hyperbolic tangent, , then the sigmoid function is a composition of these two functions where and (resp. and in the hyperbolic tangent case). According to Krantz and Parks (2002), and in both cases are real analytic. Thus, according to Lemma 3, sigmoid and hyperbolic tangent functions are real analytic.

If is ReLU, i.e.,